=Paper=
{{Paper
|id=Vol-1175/CLEF2009wn-ImageCLEF-VillenaRomanEt2009
|storemode=property
|title=MIRACLE-GSI at ImageCLEFphoto 2009: Comparing Clustering vs. Classification for Result Reranking
|pdfUrl=https://ceur-ws.org/Vol-1175/CLEF2009wn-ImageCLEF-VillenaRomanEt2009.pdf
|volume=Vol-1175
|dblpUrl=https://dblp.org/rec/conf/clef/Villena-RomanLG09
}}
==MIRACLE-GSI at ImageCLEFphoto 2009: Comparing Clustering vs. Classification for Result Reranking==
MIRACLE-GSI at ImageCLEFphoto 2009: Comparing Clustering vs. Classification for Result Reranking Julio Villena-Román1,3, Sara Lana-Serrano2,3, José C. González-Cristóbal2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS - Data, Decisions and Language, S.A. jvillena@it.uc3m.es, slana@diatel.upm.es, josecarlos.gonzalez@upm.es Abstract This paper describes the participation of MIRACLE-GSI research consortium at the ImageCLEF 2009 Photo Retrieval Task. For this campaign, the main purpose of our experiments was to compare the performance of a “standard” clustering algorithm, based on the k-Medoids algorithm, against a more simple classification technique that makes use of the cluster assignment that was provided for a subset of topics by the task organizers. First a common baseline algorithm was used in all experiments to process the document collection: text extraction, tokenization, conversion to lowercase, filtering, stemming and finally, indexing and retrieval. Then this baseline algorithm is combined with these two different result reranking techniques. As expected, results show that any reranking method outperforms a standard non-clustering image search baseline algorithm in terms of cluster recall. In addition, using the information of cluster assignments leads to the best results. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.2 Information Storage; H.3.3 Information Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital libraries. H.2 [Database Management]: H.2.5 Heterogeneous Databases; E.2 [Data Storage Representations]. Keywords Image retrieval, domain-specific vocabulary, thesaurus, linguistic engineering, information retrieval, indexing, relevance feedback, topic expansion, ImageCLEF Photo Retrieval task, ImageCLEF, CLEF, 2009. 1. Introduction MIRACLE is a research consortium formed by research groups of three different universities in Madrid (Universidad Politécnica de Madrid, Universidad Autónoma de Madrid and Universidad Carlos III de Madrid) along with DAEDALUS, a small/medium size enterprise (SME) founded in 1998 as a spin-off of two of these groups and a leading company in the field of linguistic technologies in Spain. MIRACLE has taken part in CLEF since 2003 in many different tracks and tasks. The basic goal of the ImageCLEF 2009 Photo Retrieval task [1] was, similar to previous campaigns, given a multilingual statement describing a user specific information need, find as many relevant images as possible from a given multilingual document collections containing images and text. However, the task introduces a different approach to evaluation by studying image clustering. The idea is that the top results for the given topics must contain diverse items representing different subtopics within the results. This is because a search engine that retrieves a diverse, yet relevant set of images at the top of a ranked list is supposed to be more likely to satisfy its users. Participants are provided with a set of topics, which are run on their image search system to produce a ranking that in the top 20, holds as many relevant images that are representative of the different subtopics within the results. Evaluation is based on two measures: precision at 20 and instance recall at rank 20 (also called S-recall), which calculates the percentage of different clusters represented in the top 20. This campaign a new data set containing half a million images was used. MIRACLE team decided to split into two subgroups, MIRACLE-GSI (Grupo de Sistemas Inteligentes – Intelligent System Group) in charge of purely textual runs and MIRACLE-FI (Facultad de Informática, Computer Science Faculty) in charge of visual and mixed runs. This paper reviews the participation of MIRACLE-GSI at ImageCLEFphoto 2009. The participation of the other subgroup is described in an accompanying paper. Our idea for this campaign was to continue the open line of research [2] [3] in clustering techniques applied to result reranking. The main purpose of our experiments was to compare the performance of a “standard” clustering algorithm, based on the k-Medoids algorithm [4], against a more simple classification technique that makes use of the cluster assignment that was provided for a subset of topics by the task organizers. All experiments were fully automatic, with no manual intervention, and are described in the following sections. 2. Experiments Based on our experience in previous campaigns, we designed a flexible system in order to be able to execute a large number of runs that exhaustively many combinations of different techniques. Our system is composed of a set of small components that are easily combined in different configurations and executed sequentially to build the final result set. Specifically, our system is composed of five modules: • Linguistic processing module, which extract, parses and prepares the input text for subsequent modules, • Expander module, which expands documents and/or topics with additional related terms using textual and/or statistical methods, • Textual (text-based) retrieval module, which indexes image annotations in order to search and find the list of images that are most relevant to the text of the topic, • Result combination module, which uses OR/AND operators to combine, if necessary, two different result lists, • Clustering module, which reranks the result list to allow cluster diversity. Figure 1 shows an overview of the system architecture. Figure 1. Overview of the system. A common baseline algorithm was used in all experiments to process the collection, following these steps: 1. Text Extraction: Ad-hoc scripts are run on the files that contain image annotations in XML format. 2. Tokenization: This process extracts basic textual components. Some basic entities are also detected, such as numbers, initials, abbreviations, and years. So far, compounds, proper nouns, acronyms or other types of entity are not specifically considered. The outcomes of this process are only single words, years in numbers and tagged entities. 3. Conversion to lowercase: All document terms are normalized by changing all letters to lowercase. 4. Filtering: All words recognized as stopwords are filtered out. Stopwords in the target languages were initially obtained from the University of Neuchatel’s resources page [5] and afterwards extended using our own developed resources [2]. 5. Stemming: This process is applied to each one of the words to be indexed or used for retrieval. Standard Porter stemmers [6] for each considered language have been used. 6. Indexing and retrieval: Lucene [7] was used as the information retrieval engine for the whole textual indexing and retrieval task. The topic set was divided into two subgroups. The first 25 topics include a cluster assignment provided by the task organizers, i.e., some clues were given to guide the clustering process. For those topics, classification techniques can be used to produce the final result list. The rest of the topics did not include any cluster assignment, so “standard” clustering techniques had to be used to produce the final result list. On the one hand, the classification technique finds, for each topic, the list of images that are relevant to each given cluster, and, in addition, the list of images that are relevant to the topic but do not match any of the given clusters (“Others” cluster). For that purpose, the algorithm first builds as many subtopics as different clusters have been provided for a given topic. These subtopics contain the original topic terms combined with the terms of the cluster titles. For instance, if topic A has 2 clusters associated (A1, A2), the set of subtopics would be: {termsA AND termsA1} and {termsA AND termsAB} Second, the algorithm builds another subtopic that includes the topic terms but excludes the terms of all clusters. For the previous example, the subtopic for “Others” cluster would be: {termsA AND NOT termsA1 AND NOT termsA2} Then those subtopics are given to the Lucene information retrieval engine to get the relevant list of images. Last, each image is assigned to the cluster that corresponds to the subtopic with which the image has the highest similarity. On the other hand, the clustering technique is based on an implementation of k-Medoids clustering algorithm [4], with k (the target number of clusters) equal to 20 and the maximum number of epochs set to 40. This algorithm is run over a sparse term-document matrix built with the image annotations that are given as results of a textual search over the image index using each topic. For each resulting cluster, the element with higher relevance in the baseline image result list is selected as the class prototype, and reranked to the top of the final result list. 3. Results and Conclusions Table 1 shows the complete list of submitted runs along with a brief description. Table 1. Description of experiments Run Identifier Method MIRGSI1_T_TXT no clustering (baseline) MIRGSI2_T_TXT k-Medoids clustering MIRGSI_TCT_TXT classification with topic+cluster titles [only for topics 1-25] Results are presented in the following tables, showing the run identifier, the number of relevant documents retrieved, the mean average precision (MAP), precision at 10, 20 and 30 first results, and cluster precision at 10, 20 and 30 first results. Table 2 shows the results for the first 25 topics and Table 3 shows the results for the remaining topics. Table 2. Results for queries 1-25 RelRet MAP P10 P20 P30 CR10 CR20 CR30 MIRGSI1_T_TXT 8777 0.484 0.796 0.790 0.799 0.417 0.531 0.600 MIRGSI2_T_TXT 8777 0.477 0.784 0.750 0.760 0.455 0.617 0.634 MIRGSI_TCT_TXT 8795 0.481 0.776 0.770 0.773 0.643 0.755 0.782 Table 3. Results for queries 26-50 RelRet MAP P10 P20 P30 CR10 CR20 CR30 MIRGSI1_T_TXT 9596 0.514 0.760 0.742 0.740 0.510 0.609 0.667 MIRGSI2_T_TXT 9596 0.502 0.740 0.732 0.745 0.565 0.668 0.682 The following figures show the precision and cluster recall values for each run and allow comparing results achieved by the classification technique (MIRGSI_TCT_TXT) with respect to the clustering technique (MIRGSI2_T_TXT) in each subset of topics. Data series identified with “part 1” refer to topics 1-25 whereas data series identified with “part 2” refer to topics 26-50. Figure 2. Precision at N, for all runs Figure 3. Cluster recall at N, for all runs The baseline experiment achieves the best result in terms of MAP. However, the best cluster recall (CR), which was the variable to maximize in this task, is achieved when other techniques are used, thus proving to be valuable. As it could be expected, the run that makes use of the manually assigned clusters (MIRGSI_TCT_TXT) achieves the best results in terms of cluster recall, and clearly outperforms the baseline experiment (0.782 vs 0.600 at CR30, 130%). Morever, the k-Medoid clustering is slightly better than the baseline experiment in cluster recall at any value. After this preliminary analysis, the conclusion that can be drawn is that the application of clustering techniques improves the information retrieval process and shows quite promising results. Acknowledgements This work has been partially supported by the Spanish R+D National Plan, by means of the project BRAVO (Multilingual and Multimodal Answers Advanced Search – Information Retrieval), TIN2007-67407-C03-03 and by Madrid R+D Regional Plan, by means of the project MAVIR (Enhancing the Access and the Visibility of Networked Multilingual Information for the Community of Madrid), S-0505/TIC/000267. References 1. Paramita, M., Sanderson, M. and Clough, P. Diversity in photo retrieval: overview of the ImageCLEFPhoto task 2009. CLEF working notes 2009, Corfu, Greece, 2009. 2. Lana-Serrano, Sara; Villena-Román, Julio; González-Cristóbal, José Carlos. MIRACLE-GSI at ImageCLEFphoto 2008: Experiments on Semantic and Statistical Topic Expansion. Evaluating Systems for Multilingual and Multimodal Information Access 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus, Denmark, September 17-19, 2008, Revised Selected Papers. Peters, Carol et al (Eds.). Lecture Notes in Computer Science, 2008 (printed in 2009). 3. Villena-Román, Julio; Lana-Serrano, Sara; Martínez-Fernández, José Luis; González-Cristóbal, José Carlos. MIRACLE at ImageCLEFphoto 2007: Evaluation of Merging Strategies for Multilingual and Multimedia Information Retrieval. Advances in Multilingual and Multimodal Information Retrieval. 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, Revised Selected Papers. Carol Peters et al (Eds.). Lecture Notes in Computer Science, Vol. 5152, 2008. ISSN: 0302-9743/1611-3349. 4. Park, Hae-sang; Lee, Jong-seok; Jun, Chi-hyuck. A K-means-like Algorithm for K-medoids Clustering and Its Performance. Proceedings of the 36th CIE Conference on Computers & In-dustrial Engineering, pp.1222- 1231, Taipei, Taiwan, Jun. 20-23 (2006). 5. University of Neuchatel. Page of resources for CLEF (Stopwords, transliteration, stemmers …). On line http://www.unine.ch/info/clef [Visited 10/08/2008]. 6. Porter, Martin. Snowball stemmers and resources page. On line http://www.snowball.tartarus.org [Visited 10/08/2008]. 7. Apache Lucene project. http://lucene.apache.org [Visited 09/11/2008].