KIDS-NUTN at ImageCLEF 2012 Photo Annotation and Retrieval Task Been-Chian Chien, Guan-Bin Chen, Li-Ji Gaou, Chia-Wei Ku, Rong-Sing Huang, and Siao-En Wang Knowledge, Information, and Database System Laboratory Department of Computer Science and Information Engineering National University of Tainan, Tainan, Taiwan bcchien@mail.nutn.edu.tw Abstract. The task of visual concept detection, annotation, and retrieval using Flickr photos at ImageCLEF 2012 was organized as two subtasks: concept an- notation and concept retrieval. In this paper, we present the effort of KIDS lab for the two subtasks. The proposed approaches combine various visual and tex- tual features, dimension reduction methods, the random forest classification models, and the semi-supervised learning strategy. For the concept annotation subtask, the annotation results show that combination of tags and visual features outperforms visual-only features while using the same classification model. The results also show that semi-supervised learning is not superior to supervised learning in this subtask. Further, it does not seem able to gain more advantage on F-measure when more different visual features were used. For the concept retrieval task, the results illustrate that the textual features contain much richer informatics than visual features in general retrieved concepts. Keywords: ImageCLEF, Image Annotation, Image Classification, Image re- trieval, Random Forest 1 Introduction The ImageCLEF 2012 visual concept detection, annotation, and retrieval task using Flickr photos arranged two subtasks [13]: the concept annotation task and the con- cept-based retrieval task. The challenge of the first subtask, concept annotation, is to assign each image to a set of concepts taken from a list of 94 pre-defined concepts automatically. This task takes a subset of the MIRFLICKR-1M collection containing 15,000 images for training and 10,000 images for testing. The second subtask, con- cept-based retrieval, aims on retrieving target images from a subset of the MIRFLICKR collection comprising of 200,000 photos for 42 concept queries. The queries are provided in the XML format containing title, description, and three anno- tated images located in the training set of subtask 1. In this paper, we describe the approaches used in the two subtasks including extraction of image features, concept learning models and concept retrieval methods. 2 The annotated concepts in this task cover a wide range of topics such as natural scene, animal kinds, human gender, human emotions, transportation tools, etc. Some of the concepts are so abstract and ambiguous in semantics that even users can not annotate the images well. In order to annotate images precisely, we worked at the tasks in the following aspects. First, various visual features are extracted from images to investigate the correlation between the visual features and the annotation concepts. Second, the high-dimension and multilingual textual tags need to be analyzed, proc- essed and reduced for improving the efficiency of image annotation and retrieval in large datasets. Third, effective classification models and efficient learning methods are necessary for integrating visual and textual features to generate multi-label classi- fiers in annotating numerous concepts. Our concept annotation approaches are the combinations of different techniques including image features extraction, text processing, dimension reduction, the random forest classification models, and the semi-supervised learning strategy. The validation set used 5,000 images selected from the given 15,000 training images and the left 10,000 images were used for training in our evaluation. After tuning the parameters, the final classification models are learned from the total 15,000 training images and the 10,000 testing images were annotated. We also used the modified MBRM [4] method as a baseline method to observe and compare the effectiveness of the annota- tion approaches. The concept retrieval approaches are based on the concept annota- tion approaches. This paper will also discuss the ranking method of images retrieval from the given three query images for the subtask 2. The rest of this paper is organized as follows. In section 2, we describe the extrac- tion and preprocess of visual features and textual features, respectively. The feature reduction method, concept learning methods and classification models are presented in Section 3. The concepts annotation and retrieval methods are also given in this section. Section 4 shows and discusses the experimental results for different submis- sion runs. Finally, we draw a conclusion for our labs in Section 5. 2 Extraction and Preprocess of Image Features 2.1 Feature Extraction The original image data set in the Flickr photo task consists of JPEG images, EXIFs in image files, and supplementary tags for each image. The main image features are thus considered to be extracted from the JPEG image and the textual part. In this sub- section we first introduce the extraction of visual features and textual features, respec- tively. Then, the process of normalization on visual features is described in the next subsection. Visual Features. The annotated concepts in the Flickr photo task are very diverse. Although the total annotated 94 concepts are categorized as natural elements, envi- ronment, people, image elements, and human elements, the job of concept annotating is still ambiguous and vague in visual for photos from the viewpoints of different 3 persons. For collecting visual features as many as possible from an image, first, an image was equally segmented into 16 sub-images (in 4 by 4 blocks). The original image and its corresponding 16 sub-images, totally 17 images, are the sources of gen- erating visual features. Basically, the four visual features, AutoColorCorrelogram [5], ColorLayout [1], FCTH [2], Gabor [11], were extracted from each original image and 16 sub-images. Gist [12] feature is only applied to the original images. Each image generates a list of multi-dimensional data. Except for extracting the five visual features, the region of interest (ROI) in origi- nal images are also marked automatically by the visual attention model proposed by Itti, et al. [7]. We modified the method by applying 6-level Gaussian pyramid to gen- erate the saliency map representing the degree of concern in an image. Then, the re- gion growing method [13] was used to mark the appropriate ROIs. For the 16 blocks of sub-image, each block of sub-image is marked as foreground if the area of a block is covered over 60% by marked ROIs; otherwise, the block is marked as background. The AutoColorCorrelogram values in the blocks of background sub-image then were averaged as the visual feature of ROI background. To recognize the number of people in photos, the package of face detection in OpenCV was used to detect and estimate the number of persons in each photo. The numbers of dimensions for the extracted visual features are summarized in Table 1. The visual features including AutoColorCorrelogram, ColorLayout, FCTH, and Gabor were extracted by applying LIRE (Lucene Image REtrieval) JAVA library.1 The face detection tool was implemented by using OpenCV. 2 The ROI marking method and the Gist method were designed and implemented by ourselves. Table 1. The used visual features. Visual features Feature dimensions #Images Total AutoColorCorrelogram [5] 1024 17 17408 ColorLayout [11] 120 17 2040 FCTH [2] 192 17 3264 Gabor [11] 60 17 1020 Gist [12] 192 1 192 ROI background [7,13] 16 1 1024 Textual Features. The textual information for the Flickr photos comes from the EXIF (Exchangeable image file format) and announced tags of each image. The EXIF is a standard specification that specifies the formats of media data like images and sounds produced by digital cameras. A given EXIF contains 407 fields totally in each image, but only 24 EXIF fields were selected (e.g. black level, blur warning, bright- ness, compression, contrast, data and time, zoom, expiration, ISO, noise, etc.). The other source of textual features is the description file of tags for each image. The tags describe some kinds of related semantic information of the images. Before 1 http://www.semanticmetadata.net/lire 2 http://opencv.org 4 applying the tags information to annotate images, two problems should be solved. First, the practical tags of images are multilingual. Actually, more than 68 different languages are found in the set of tags. Synonyms of terms need to be unified. Second, the problem of high-dimension features must be reduced. To resolve the two prob- lems, the Google translation tools3 were used to translate the multilingual tags into English, and the stop words then were deleted from the set of tags. The number of the final tags is 60821 terms in English. The term frequency for each tag was also counted and recorded. Further, in order to support the detection of humans in photos, the package of face detection in OpenCV was used to detect and estimate the number of persons in each photo. The range of the estimated number of people is between 0 and 13. The face number for each photo is marked by binary information as 14 features. The numbers of final textual features are listed in Table 2. Table 2. The used textual features. Textual features original After extraction Number of faces 1 14 EXIF 407 24 Tags 69099 60821 Total 69507 60859 2.2 Preprocess of Visual Features The final extracted visual features and textual features in subsection 2.1 are quite various in dimensions and ranges of values. Furthermore, the high-dimension feature is an important problem for learners to generate annotating models. For reducing the dimensions and combining the extracted visual and textual features in a unified repre- sentation, the visual features are processed as follows. Let I be an image set with n images and Ii be an image in I. The x denotes the vector of a specified visual feature and xi represents the multi-dimension vector of the specified visual feature for the image Ii. We have xi = (xi1, …, xim), an m dimensional vector, and xij is the value of jth-dimension of the xi for the image Ii , 1  j m. We assume that C1, C2, … , CK represent the K possible annotated concepts in the system. |Ck| denotes the number of images belonging to the concept Ck in the image set I. We first calculate the mean vector k and deviation vector kof the visual feature for each concept Ck, as follows: x I C k i  (x   ) I iCk i k 2 k  i , k  , 1  k K. (1) Ck Ck 3 http://translate.google.com 5 Then, the concept similarity of the visual feature xi corresponding to the concept Ck can be defined as m   x   2  yik j 1 exp    ij kj   , 1  k K,   kj      (2)  where kj and kj are the jth-dimension values of the vectors k and krespectively, for the image Ii I, 1  i n. Hence, a m-dimension visual feature xi of an image Ii will be normalized as a K-dimension features yi = [yik], 0 yik 1  k K. Since multidimensional visual features of an image, as shown in Table 1, were transformed into 94 dimensions (the number of concepts), the total number of features is 6580 after the processing. 3 Feature Reduction and Concept Learning Models Although we had reduced part of the number of features, the extracted visual features and textual features in the previous section still have very high dimensions (67439 features in total). Generally, it is not easy for any classification model to learn effec- tive classifiers from high dimensional datasets efficiently. For dealing with the high dimensional datasets, we applied a feature reduction method, discriminant coefficient [9, 10], to reduce dimensions before learning the classifiers. The submitted runs are mainly based on two learning models: the random decision trees [3] and the Multiple Bernoulli Relevance Models (MBRM) [4]. Except for the supervised learning strat- egy, the semi-supervised learning strategy is also considered for investigating the feasibility in image annotation. In this section, we briefly present the main methods used in this task including the feature reduction method, the concept classification models and the leaning strategies in the following subsection. 3.1 Features Reduction The reduction method is based on the discriminant coefficient proposed by Lin & Chien [9, 10]. In the method, the discriminant coefficients are calculated by the dif- ference between the statistics of two classes. Before calculating the discriminant coef- ficients, the image features need to be normalized according to the class of concept. Let yi be the concept similarity of visual features xi as defined in Section 2.2 and yij is the jth dimension of transformed concept similarity for a visual feature in the image Ii. For textual features, yi is the term frequency of textual features and yij is the term frequency of the term j for the image Ii. The normalization of visual and textual fea- tures are defined as  y , for 1 k K. ij f kj  I i C k   (3) | Ck | 6 The normalized features can be denoted as a matrix F = [fij]KP, K is the number of conceptual classes, and P is the number of all transformed visual features and all final extracted textual features. The feature reduction method in [10] first calculates the relative discriminant variables of each feature for all K conceptual classes. Then, the discriminant variables are normalized to be the log-scaled discriminant coefficient matrix J = [Jij] KP. The range of Jij is between 0 and 1. A large Jij represents that the jth feature has high discrimination on the concept Ci. On the contrary, a small Jij value means that the jth feature provides less discernable information for the concept Ci. We assume that the matrix Y = [yij] nP, yij is the visual and textual features for the image Ii and n is the number of images in the training set. Finally, the goal of feature reduction is to find a transformation matrix T such that the number of visual and tex- tual features is much smaller than the original features. The transformation of feature reduction can be completed by the following equation: T = Y  Jt , (4) where J t is the transpose of matrix J. After transforming of the equation (4), the T is a n  K matrix which is used to replace the matrix Y as the reduced features of training set for learning models. 3.2 Random Forest The random decision tree method [3] is an ensemble classifier that first builds a num- ber of decision trees randomly. Each decision tree is constructed by selecting a non- tested feature randomly as the decision node at each level. The training data are not used in the tree construction and is independent from the tree structures completely. After the decision trees are built, the training data are used to update the statistics of the classes at each node for all random decision trees. While classifying an unknown example, the predicted class is estimated by trees voting or averaging the possibilities of all decision trees to determine the classification result. The Matlab code4 of the Random Forests was used in the task. For dealing with the multi-label problem, a two-class classifier was learned for each annotation con- cept. Although totally 94 classifiers should be learned, the random forest method is still efficient because the reduced matrix T is used to be the training set. 3.3 Semi-supervised Learning Generally, the training set containing only labeled data are applied to build classifiers by supervised learning method. In this paper, we also investigated the feasibility of applying semi-supervised learning. Semi-supervised learning uses both labeled data and unlabeled data to perform the learning process. The goal is to integrate the unla- beled data to improve the effectiveness of classification. 4 http://code.google.com/p/randomforest-matlab/ 7 The first step of using semi-supervised learning is to use all ground truth labeled data to learn classifiers as Section 3.2. Next, the unlabeled data are classified and ranked by their voting ratios of random decision trees. The top 10 positive examples and top 10 negative examples from the unlabeled data are then added to the training set, and new classifiers are re-trained. Such a learning process proceeds k times itera- tively. The final classification models are used to annotate the concept of images. 3.4 Multiple Bernoulli Relevance Models method The Multiple Bernoulli Relevance Models method (MBRM) was proposed by Feng, et al. 4] to solving the problem of automatic image annotation. This method and its modified weighting version were implemented as the baseline methods in the annota- tion task. We briefly introduce the method in the following. Let I denote the training set of annotated images, and Ii be an image of I. Every image Ii were cut into 16 blocks in 4 by 4 rectangular sub-images. We obtain one original image r0 and the 16 sub-images r1, r2… r16. We then extract features from the 17 regions separately and assume that the features are denoted as f0 … f16. Now let Ij be a test image, and f’0… f’16 denote the features of image Ij. The joint probability P(Ij, w) is computed for each word w in the annotation vocabulary. The annotations of image Ij would be the top several words which have maximum prob- abilities. The joint probability P(Ij, w) is defined as following equation:  16 16  P ( I j , w)   I i I  P (  T i I )   Sim( f p , f q)  P( w | I i )  .  (5)  p 0 q 0 We assume that the distribution of the training set is an uniform distribution, the probability PT(Ii) = 1/n, where n is the number of images in the training set I. The Sim(fp, f′q) stands for the similarity degree between the features fp and f′q, and the P(w|Ii) is defined as following equation:   N ( w , I i )  N ( w ,I ) P(w | I i )  , (6) n where N(w, Ii) denotes the number of times the annotation w occurs in the image Ii, N(w,I) denotes the number of times the annotation w occurs in the training set I, and  is the smooth parameter. 4 The Concept Retrieval Method The concept retrieval task gave 42 queries containing a concept title, text description, and three images for each query. The query images are all in the training set and the test database comprises 20,000 photos selected from the MIRFLICKR collection. The approaches of retrieving the images with the same concept as the query are based on the concept annotation approaches in the subtask 1. 8 We first analyzed the concept ratios of the three images for each query. Next, we applied the concept annotation approaches used in the subtask 1 to annotate images in the test database. Finally, the images were ranked by the concept ratios and the voting ratios of random decision trees. Formally, we assume that the three query images are annotated by a few concepts individually. Let ij be the voting ratio of the jth concept on the image Ii , and wj be the ratio of the concept Ci annotated by the three images, 1  j  K, where K is the number of concepts. The similarity degree of the image Ii for the query Q are defined as K Sim(Q, I i )    w . j 1 ij j (7) 5 Experimental Results and Discussions 5.1 The Concept Annotation Task First, we would like to introduce the methods for the runs we submitted. In concept annotation subtask, we totally submitted five runs which based on two methods. One is the approach based on the feature reduction and random decision trees described in Section 3.1 to 3.3. The other is based the Multiple Bernoulli Relevance Models (MBRM) described in Section 3.4. The run_a1, run_a2, and run_a3 use the former method. Especially, the run_a2 applies the semi-supervised learning instead of super- vised-learning. The run_a4 and run_a5 take the latter one. The run_a4 used annotation scores to weight the probabilities of words in images. The run_a5 only considered binary annotations. The used features and methods are summarized in Table 3. Table 3. The features and methods used in the submission runs. Features run_a1 run_a2 run_a3 run_a4 run_a5 AutoColorCorrelogram ○ ○ ○ ColorLayout ○ ○ ○ ○ ○ Visual FCTH ○ ○ ○ features Gabor ○ ○ ○ Gist ○ ○ ○ ROI ○ ○ ○ Face detection ○ ○ ○ Textual EXIF ○ ○ ○ features Tags ○ Classification Random Forest ○ ○ ○ models MBRM ○ ○ Semi-supervised learning ○ Learning Feature reduction ○ ○ ○ methods Weighting features ○ 9 As Table 3 shows, the used features in run_a1 and run_a2 are the same including visual features and part of the EXIF metadata. Except for the visual and EXIF fea- tures, the run_a3 employed tags as a part of textual features. Since the MBRM method is time consuming and the time complexity is dependent on the number of features, the run_a4 and run_a5 only considered the visual feature ColorLayout. The two runs are used to be the baselines while comparing with the other runs having multi-features. The results of evaluation are shown in Table 4. The measures are low in MiAP and GMiAP because our methods are only concerned with binary annotation for each image. The confidence scores in the submission runs were produced by the voting ratios of ensemble random decision trees. We use the voting ratio 0.5 to be threshold of annotating images for a concept. However, the voting ratio generally cannot stand for the annotation confidence score of an image. Actually, we think that neither MiAP nor GMiAP measure is appropriate to be the metric in this subtask. Table 4. The results of the concept annotation task. Measures run_a1 run_a2 run_a3 run_a4 run_a5 MiAP 0.1022 0.1018 0.1717 0.0947 0.0985 GMiAP 0.0470 0.0472 0.0984 0.0495 0.0537 Precision 0.6257 0.5860 0.6313 0.6339 0.6414 Recall 0.2588 0.2153 0.3384 0.2422 0.2385 F-ex 0.3662 0.3149 0.4406 0.3505 0.3478 Table 5. The results of F1-measure for different concept categories. Concept categories run_a1 run_a2 run_a3 run_a4 run_a5 time of day 0.2751 0.2828 0.3447 0.1419 0.1065 celestial bodies 0.0000 0.0000 0.0767 0.0000 0.0000 weather 0.1043 0.1043 0.1154 0.0000 0.0000 combustion 0.0000 0.0000 0.0000 0.0000 0.0000 lighting effects 0.0052 0.0039 0.0423 0.0000 0.0000 scnery 0.0156 0.0175 0.1376 0.0000 0.0000 water 0.0029 0.0029 0.0587 0.0000 0.0000 flora 0.1064 0.1111 0.2656 0.0000 0.0000 fauna 0.0000 0.0024 0.2557 0.0000 0.0000 quantity 0.6847 0.5905 0.7362 0.6902 0.6902 age 0.1239 0.0994 0.4011 0.0425 0.0385 gender 0.0622 0.0598 0.3423 0.0035 0.0081 relationship 0.0000 0.0022 0.0574 0.0000 0.0000 quality 0.6707 0.5959 0.6902 0.6622 0.6619 style 0.0000 0.0074 0.0618 0.0000 0.0000 view 0.1829 0.1931 0.3428 0.0964 0.0933 type 0.0093 0.0043 0.1920 0.0000 0.0000 impression 0.0144 0.0137 0.1048 0.0003 0.0003 transportation 0.0027 0.0000 0.1509 0.0000 0.0000 10 Table 4 shows that all runs have high precision and low recall. The two baseline methods, MBRM and weighted MBRB, used only one visual feature to classify the concept. The run_4 applying weight scores is not improved much in comparison with the run_5. The results of supervised learning on random trees (run_a1) are better than semi-supervise learning (run_a2) in this task. Further, the results of run_a3 using supervised learning, visual features, and tags are the best. The evaluation results of F1-measure for different concept categories are also shown in Table 5. From the results some remarkable characteristics are discussed as follows:  The information of tags does improve the performance of automatic image annota- tion no matter what measures are in general.  All our approaches have higher precision rates and lower recall rates. For run_a1, run_a2 and run_a3, the features extracted by the feature reduction method based on discriminant coefficient are high discernible. The lower discernible features are eliminated. These effects should be the main reason of high precision rates and low recall rates for such a kind of method. As run_a4 and run_a5, the selected high threshold of probability P(Ij, w) might be the cause of low recall for the MBRM method.  The semi-supervised learning did not outperform supervised learning in this sub- task. The reason might be caused by the ranking of voting ratio in random trees. As above mentioned, the voting ratio cannot reflect the confidence score of image annotation. The classified test images did not ranked and added to the training set correctly.  Generally, the concept categories with high annotation rates like quality and quan- tity have much more positive examples and obvious visual features in the training set and testing set.  The concept categories with very low annotation rates such as combustion and relationship are usually few examples, abstract concept or highly dependent on semantics. It is difficult for image analyzers to find a general model for different kinds of special visual concepts. 5.2 The Concept Retrieval Task The results of concept retrieval are listed in Table 6. The three submission runs, run_r1, run_r2 and run_r3, are based on the annotation methods used in run_a1, run_a2 and run_a3, respectively. Since the methods in the concept retrieval subtask were accomplished by the annotation results of subtask 1, the performances are highly dependent upon the effectiveness of annotation results. It is obvious that run_r3 has the best results because of the higher annotation rate in run_a3. The others get low precisions. The results also show that the tags are the important factors of retrieving relevant images. Using visual features only may not retrieve correct concepts from a large amount of general images. The semantics inside images still need appropriate textual notation. 11 Table 4. The results of the concept retrieval task. Measures run_r1 run_r2 run_r3 MnAP 0.0009 0.0007 0.0313 AP@10 0.0003 0.0006 0.0051 AP@20 0.0010 0.0014 0.0077 AP@100 0.0096 0.0081 0.0729 6 Conclusion This is the first time to participate the photo annotation task for our lab. Owing to many abstract concepts cannot be described by general visual features, the innovating effective visual features for representing various concepts is important to annotate images precisely. In this paper we present the annotation methods based on precise feature reduction and the random decision trees model. All the used visual features and textual features can be extracted from images generally and easily. The best result is the model combining general visual features and tags. The combination of various visual features does not seem to improve the performance much more than only one visual feature. We also found that the different visual features usually worked well in specific concepts. The performance should be able to be improved if the appropriate visual features could be selected and used in the specific concept. After submission of the task, some analyses on general visual features and the learning strategies were made. Special features extracting models are necessary for learning classifiers to annotate concepts correctly. For example, the visual concepts, like shadow and refection in the lighting effects category, can be marked or modeled as specific regions or representations. We believe that the representative features, effective feature selection methods and machine learning models will be the solution of annotating specific concepts. However, it should be no direct answer for general visual features to detect concepts effectively. Acknowledgements. This research was supported in part by the National Science Council of Taiwan, R. O. C. under contract NSC101-2221-E-024-026. References 1. Chang, S. F., Sikora, T., Puri, A.: Overview of the mpeg-7 standard. IEEE Transactions on Circuits and Systems for Video Technology, pp. 688-695 (2001) 2. Chatzichristofis, S. A., Boutalis, Y. S.: FECH: Fuzzy color and texture histogram a low level feature for accurate image retrieval. In: the 9th International Workshop on Image Analysis for Multimedia Interactive Services, Klagenfurt, Austria, pp. 191-196 (2008) 3. Chen, C., Liaw, A., Breiman, L.: Using random forest to learn imbalanced data. Technical Report. no. 666, Department of Statistics, University of Berkeley (2004) 4. Feng, S. L., Manmatha, R., Lavrenko, V.: Multiple Bernoulli reference models for image and video annotation. In: IEEE International Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, pp. 1002-1009 (2004) 12 5. Huang, J., Kumar, S.R., Mitra, M., Zhu, W.J., Zabih R.: Image Indexing Using Color Cor- relograms. In: the International Conference on Computer Vision and Pattern Recognition. San Juan, Puerto Rico, pp. 762-768 (1997) 6. Huiskes, M. J., Lew, M. S.: The MIR Flickr retrieval evaluation. In: ACM International Conference on Multimedia Information Retrieval (MIR'08). Vancouver, Canada (2008) 7. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transaction on Pattern Analysis and Machine Intelligence. vol. 20, pp. 1254-1259 (1998) 8. Lienhart, R., Maydt, J.: An extended set of Haar-like features for rapid object detection. In: IEEE ICIP 2002. vol. 1, pp. 900-903. (2002) 9. Lin, Y. X., Chien, B. C.: A discriminant based document analysis for text classification. In: the International Computer Symposium, Workshop of Artificial Intelligence, Knowledge Discovery, and Fuzzy Systems, Dec. 16-18, 2010, Tainan, Taiwan, pp. 594-599. 10. Lin, Y. X., Chien, B. C.: Efficient feature reduction for high-precision Text classification. In: the National Computer Symposium on Databases, Data Mining, and Information Re- trieval. Chia-Yi, Taiwan (2011) 11. Manjunath, B. S., Ma, W. Y.: Texture features for browsing and retrieval of large image data. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18 (8), August, pp. 837-842. (1996) 12. Siagian, C., Itti, L.: Rapid biologically-inspired scene classification using features shared with visual attention. IEEE Transactions on Pattern Analysis and Machine Intelligence. vol. 29, no. 2, pp. 300-312. (2007) 13. Thomee, B., Popescu A. : Overview of the ImageCLEF 2012 Flickr Photo Annotation and Retrieval Task. In: CLEF 2012 working notes, Rome, Italy, 2012. 14. Zhai, Y., Shah, M.: Visual attention detection in video sequences using spatiotemporal cues. In: 14th Annual ACM International Conference on Multimedia. pp. 815-824. (2006)