Introduction

IPL at CLEF 2012 Medical Retrieval Task

Spyridon Stathopoulos

Nikolaos Sakiotis

Theodore Kalamboukis

0 0 Department of Informatics Athens University of Economics and Business , Athens , Greece

This article presents an experimental evaluation on using Latent Semantic Analysis (LSA) for searching very large image databases. It also describes IPL's participation to the image CLEF ad-hoc textual and visual retrieval for the medical task in 2012. We report on our approaches and methods and present the results of our extensive experiments applying data fusion on the results obtained from LSA on several low-level visual features.

LSA LSI CBIR

Introduction

The continuous advances of internet and digital technologies, as well as the rapidly increasing multimedia content used by modern information systems, have imposed a need for an e cient system for organizing and retrieving content from large multimedia collections. However, the performance in image retrieval is still very far from being e ective for several reasons: computational cost, scalability and performance.

In our runs this year we have experimented with Latent Semantic Analysis (LSA), a technique that, although has been used successfully in many applications in the domain of text retrieval, [ 1 ] it has not experienced similar success in CBIR. The main reason being the density of the f eatures images matrix, C, generated in image retrieval, in contrast to textual retrieval where the term document matrix is sparse. As a result, complexity cost of SVD is raised to prohibitively high levels for both, space and computational time.

In this article we give an overview of the application of our methods to ad-hoc medical retrieval and present the results of our submitted runs. Our e orts this year were concentrated on applying the LSA method to a number of low-level visual features and then using data fusion techniques on the SVD transformed low rank approximation of images to enhance retrieval. We explore a what we call SVD-bypass technique to factor the feature matrix by solving a much smaller in size eigenproblem of the term correlation matrix CCT instead of solving the SVD of matrix C. This method proved to be a much more e cient and scalable solution for large data sets.

In the next section, we describe our approach and in the following sections we present the submitted IPL's runs on textual and visual retrieval with their corresponding descriptions and results. Finally, in the last section we conclude the remarks of this work with propositions for future research. 2

Visual Retrieval

According to the traditional use of LSA in information retrieval a term-bydocument matrix, C, is rst constructed and an SVD analysis is then performed on this matrix. However as stated before, the feature matrix in the case of image retrieval, is a dense matrix. This increases the computational costs of the SVD analysis to prohibitive levels for large image databases. A typical example for our database this year, for the color layout feature will produce a matrix of size 11288 305000 30GB (in double precision) which makes the SVD impossible to solve with our computer resources. In our LSA implementation we solve the eigenproblem of the feature correlation matrix CCT instead. This matrix, for a suitable representation of the images is of a moderate size, demanding less storage and the eigenvalue problem of CCT can be solved much faster than the SVD factorization of the matrix C. We then approximate the feature matrix taking only the k-largest eigenvalues and corresponding eigenvectors of matrix C, for a suitable value of k. 2.1

Preprocessing of the data

It is well known that the representation of a digital image depends on several factors, from its resolution to color models etc. In a collection of images, it is highly possible that there will be important variations considering these characteristics. Thus, each image undergoes through several transformations before the feature extraction step. In our case we have applied the following transformations. 1. Size normalization. All images are re-scaled to the same size. 2. Transformation to gray-scale images. 3. Tile splitting. Each image is split into equal-sized, non-overlapping cells we reference to as tiles. 2.2

Feature Extraction and Selection

The vector representation of the images was based on three low-level features of MPEG-7, Scalable Color (SC) with 64 coe cients per tile, Color Layout (CL) having 192 coe cients per tile and the Edge Histogram (EH) feature. Experiments on CLEF 2011 image collection showed that, the extraction of the edge histogram per tile had a negative impact on retrieval performance, thus this feature was extracted from the whole image instead. All the features were extracted using the Java library Caliph&Emir of the Lire CBIR system [ 2 ].

Finally, a simple histogram with 32 levels of gray colors was extracted from each tile. To increase the discriminating power of the histogram, we remove the levels with high frequency and normalize the remaining histogram values for all images. At the same time, all histogram levels with a total frequency above 80% are considered stop-words and thus, they are removed. We refer to this feature as Color Selection Histogram (CSH). 2.3

Construction of the feature-correlation matrix CCT As we have already mentioned the matrix C is full in the case of CBIR and so it is the matrix CCT . This matrix multiplication is the most intensive part of the computations and memory demanding. In our implementation we overcome all these problems by splitting the matrix C into a number of blocks, such that each block can be accommodated into the memory (C = (C1; C2; :::; Cp)) and calculate CCT by:

CCT =

p X CiCiT i=1

After solving the eigenproblem of the feature-correlation matrix CCT , the k largest eigenvectors, say Uk, and the corresponding eigenvalues, are selected. The original feature vectors are then projected into the k th dimensional space using the transformation

yk = UkT y on the original vector representation of an image y. 2.4

Data Fusion

For the data fusion task we used a weighted linear combination of the results obtained by using the LSA method on di erent features, as de ned by : SCORE(Q; Image) =

X wiscorei(Q; Image) i (3) (1) (2) were scorei denotes the similarity score of an Image with respect to a feature i. The weight of each feature type is determined as a function of its performance [ 3 ]. The wi's were estimated by the square of the Mean Average Precision (MAP) values attained by the corresponding feature on the CLEF '11 collection [ 4 ]. These values are listed in Table 1.

Textual Retrieval

This year's collection contains a subset of PubMed of 305000 images from PubMed. A detailed description of the collection in given in the overview paper in [ 5 ]. Since records have the same structure as in previous CLEF collections over the past years, we followed the same steps in the textual retrieval task as in CLEF 2011. Each record is identi ed by a unique gureID, which is associated with: the title of the corresponding article, the article URL, the caption of the gure, the pmid and the gure URL. From the pmid we downloaded the MeSH terms assigned to each article.

Our retrieval system was based on the Lucene 1 search engine. For indexing we removed stop-words and applied Porter's stemmer. For the multi- eld retrieval the weights of the elds were assigned at indexing time. We kept the same structure of the database as in CLEF 2009, 2010 and 2011. This year we used only the default scoring function 2 which was best performing at the 2011 CLEF Ad-Hoc retrieval.

Also we use the same weights for the elds as in the last three years [ 6, 7 ]. We used two sets of weights: one, that was estimated empirically on the CLEF 2009 collection and a second set where the weights were estimated by the value of the Mean Average precision estimated on the CLEF 2010 collection. 1 http://lucene.apache.org/ 2 http://lucene.apache.org/core/old versioned docs/versions/3 5 0/api/core/org/ apache/lucene/search/Similarity.html

Experimental Results Results from Textual Retrieval

This year we've submitted a total of six runs using di erent combinations of elds and corresponding weights. In Table 2 we give the de nitions of our textual runs and in Table 3 their corresponding results. 4.2

Results from Visual Retrieval

For the visual retrieval task, we've also submitted a total of six runs, using di erent values for the parameter k which de nes the selected number of eigen-values and vectors to use for indexing and retrieval. For all runs a data fusion on various features was applied, using the weights acquired from runs of each individual feature with the CLEF's 2011 collection. In Table 4 we give the de nitions of our runs and in Table 5 their corresponding results.

LSA with k=20 on 64 tiles for Scalable Color, Color layout Color Selection Histogram R5: IPL AUEB DataFusion LSA

SC CL CSH 64seg 50k

LSA with k=50 on 64 tiles for Scalable Color, Color layout Color Selection Histogram R6: IPL AUEB DataFusion LSA

SC CL CSH 64seg 100k

LSA with k=100 on 64 tiles for Scalable Color, Color layout Color Selection Histogram

Conclusions and Further work

We have presented a new approach to LSA for CBIR replacing the SVD analysis of the feature the matrix C (m n) by the solution of the eigenproblem for the matrix CCT (m m). The method overcomes the high cost of SVD in terms of memory and computing time. In addition, in all the experiments, of which only a small part was submitted o cially this year, the optimal value of the approximation parameter was less than 50 which makes the method attractive for fusion with several low level features. Certainly our approach is promising and has created new research directions that need further investigation. The image representation has an impact on LSA performance and a more systematic research on that direction is currently under progress. Also the eigenvalues of the matrix CCT follow a Zip an distribution with the k-th largest values been well separated giving small residual vectors to machine accuracy, which give us an evidence on the stability of the calculated eigenvectors. More work is currently underway in order to determine the stability of the proposed method.

1. Deerwester , S.C. , Dumais , S.T. , Landauer , T.K. , Furnas , G.W. , Harshman , R.A. : Indexing by latent semantic analysis . JASIS 41 ( 6 ) ( 1990 ) 391 { 407

2. Lux , M. , Chatzichristo s , S.A.: Lire: lucene image retrieval: an extensible java cbir library . In El-Saddik , A. , Vuong , S. , Griwodz , C. , Bimbo , A.D. , Candan , K.S. , Jaimes , A., eds.: ACM Multimedia, ACM ( 2008 ) 1085 { 1088

3. Wu , S. , Bi , Y. , Zeng , X. , Han , L .: Assigning appropriate weights for the linear combination data fusion method in information retrieval . Inf. Process. Manage . 45 ( July 2009 ) 413 { 426

4. Kalpathy-Cramer , J. , Muller, H., Bedrick , S. , Eggel , I., de Herrera, A.G.S. , Tsikrika , T. : Overview of the clef 2011 medical image classi cation and retrieval tasks . In: CLEF (Notebook Papers/Labs/Workshop). ( 2011 )

5. Muller, H., de Herrera , A.G.S. , Kalpathy-Cramer , J. , Fushman , D.D. , Antani , S. , Eggel , I. : Overview of the imageclef 2012 medical image retrieval and classi cation tasks . In: CLEF 2012 working notes , Rome, Ital, 2012 ( 2012 )

6. Gkoufas , Y. , Morou , A. , Kalamboukis , T. : Ipl at imageclef 2011 . In: CLEF (Notebook Papers/LABs/Workshops). ( 2011 )

7. Gkoufas , Y. , Morou , A. , Kalamboukis , T. : Combining textual and visual information for image retrieval in the medical domain . The Open Medical Informatics Journal 5 ( 2011 ) 50 { 57