=Paper=
{{Paper
|id=Vol-1175/CLEF2009wn-ImageCLEF-Vundavalli2009
|storemode=property
|title=IIIT-H at ImageCLEF Wikipedia MM 2009
|pdfUrl=https://ceur-ws.org/Vol-1175/CLEF2009wn-ImageCLEF-Vundavalli2009.pdf
|volume=Vol-1175
|dblpUrl=https://dblp.org/rec/conf/clef/Srinivasarao09
}}
==IIIT-H at ImageCLEF Wikipedia MM 2009==
IIIT-H at ImageCLEF Wikipedia MM 2009 Srinivasarao Vundavalli International Institute of Information Technology, Hyderabad-500032, Andhra Pradesh, India srinivasarao@research.iiit.ac.in Abstract In this paper, we describe the IIITH retrieval system used for the ImageCLEF Wikipedia MM task. The system automatically ranks the most similar images to a given textual query. The system preprocesses the data set in order to remove the non-informative terms. For each query, the system finds a ranked list of its most similar images using the textual information only. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Infor- mation Search and Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries Keywords TF-IDF, Information Retrieval, Image Retrieval, Vector Space Model, Boolean Model 1 Introduction ImageCLEFs WikipediaMM uses image collection created and employed by the INEX competition [1]. This image collection contains approximately 150,000 images that cover diverse topics of interest. These images are associated with unstructured and noisy textual annotations in English. There are 75 topics as the query to be searched on the collection. Each topic consists of textual data (and/or sample images and/or concepts) describing a users (multimedia) information need. These data are given as the XML files in each one there is image textual information such as image title, image concept, image location path and a short text which describes the desired results, i.e it explains what could be the relevant result and what is the irrelevant results. The aim of this task is to try to find as many relevant images as possible from the Wikipedia image collection for a given textual-and/or-visual query. Section 2 describes the models we used in our experiments, in section 3 we show the results we obtained and finally section 4 concludes the notes. 2 Models Our system uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User’s query. In general, the idea behind the VSM is the more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. It uses the Boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the Query specification. The score is calculated based on TF-IDF model[2]. 2.1 Term Frequency-Inverse Document Frequency Model In the TF-IDF model [2], each document in the collection and a query are represented by their associated vector of the length of the vocabulary. tf (t in d) correlates to the term’s frequency, defined as the number of times term t appears in the currently scored document d. Documents that have more occurrences of a given term receive a higher score. The default computation of tf (t in d) in our system is sq.rt(f requency). idf (t) stands for Inverse Document Frequency. This value correlates to the inverse of docF req (the number of documents in which the term t appears). This means rarer terms give higher contri- bution to the total score. The default computation of idf (t) in our system is 1+log(numDocs/(docF req+ 1)) 2.2 Vector Space Model(VSM) In a Vector Space Model[3,4], a document is represented as a vector. Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is tf-idf weighting[2]. The score of query q for document d correlates to the cosine-distance or dot-product between document and query vectors. A document whose vector is closer to the query vector in that model is scored higher. The scoring we used was the lucene[5] scoring. 3 Runs and Results By using the image’s filename and the text associated with the image, we submitted one run based on the models described in the previous section. The results for the wikipediaMM task have been computed with the trec eval tool (version 8.1). The submitted runs have been corrected (where necessary) so as to correpond to valid runs in the correct TREC format. The following corrections have been made: • The runs comply with the TREC fomat as specified in the submission guidelines for the task • When a topic contains an image example that is part of the wikipediaMM collection, this image is removed from the retrieval results, i.e., we are seeking relevant images that the users are not familiar with (as they are with the images they provided as examples). • When an image is retrieved more than once for a given topic, only its highest ranking for that topic is kept and the rest are removed (and the ranks in the retrieval results are appropriately fixed). • Ensure that each of the submitted runs has a unique name. The interpolated recall precision averages are shown in Figure 1, and the summary statistics for the runs sorted by MAP are shown in Table 1. Run Modality Topic FB/QE MAP P@10 P@20 R- Bpref Number of Number Fields prec. retrieved of relevat documents retrieved documents iiithr1 TXT TITLE NOFB 0.0186 0.0689 0.0389 0.0246 0.022 618 98 Table 1: Retrieval Results 4 Conclusion In this working note, we presented a retrieval system which automatically ranks the most similar images to a given textual query. The system preprocesses the data set in order to remove the non- informative terms. For each query, the system finds a ranked list of its most similar images using the textual information only. Even though our system did not perform well, we are happy that we participated in the ImageCLEF WikipediaMM task and are looking forward to ImageCLEF2009. 5 References 1. http://inex.is.informatik.uni-duisburg.de/2007/mmtrack.html 2. Salton, Gerard and Buckley, C. (1988). Term-weighting approaches in automatic text re- trieval. Information Processing and Management 24 (5): 513-523. 3. G. Salton, A. Wong, and C. S. Yang (1975), ”A Vector Space Model for Automatic Indexing”. Communications of the ACM, vol. 18, nr. 11, pages 613620. 4. F.Song and W.B.Croft (1999). A General Language Model for Information Retrieval. Re- search and Development in Information Retrieval: 279-280. 5. http://lucene.apache.org