1. INTRODUCTION

Zsombor Paróczi

paroczi@tmit.bme.hu 0

Máté Kis-Király

kis.kiraly.mate@gmail.com 0

Bálint Fodor

balint.fodor@gmail.com 0 0 Budapest University of , Technology and Economics

2015

14 15

In this paper we present our contribution to the MediaEval 2015 Retrieving Diverse Social Images Task which requested participants to provide methods for re ning Flickr image retrieval results thus to increase their relevance and diversi cation. Our approach is based on re-ranking the original result, using a precomputed distance matrix and a spectral clustering scheme. We use color related visual features, text and credibility descriptors to de ne similarity between images.

1. INTRODUCTION

When a potential tourist makes an image search for a place, she expects to get a diverse and relevant visual result as a summary of the di erent views of the location.

In the o cial challenge (Retrieving Diverse Social Images at MediaEval 2015) [ 2 ] a ranked list of location photos retrieved from Flickr is given, and the task is to re ne the result by providing a set of images that are both relevant and provide a diversi ed summary. An extended explanation for the task objectives, provided dataset and evaluation descriptors can be found in the task description paper [ 2 ]. The diversity means that images can illustrate di erent views of the location at di erent times of the day/year and under di erent weather conditions, creative views, etc. The utility score of the re nement process can be measured using the precision and diversity metric [ 8 ].

Our team participated in previous challenges [ 7, 6 ], each year we experimented with a di erent approach. In 2013 we used diversi cation of initial results using clustering, but our solution was focused on diversi cation only [ 7 ]. In 2014 we tried to focus on relevance and diversity with the same importance as a new idea [ 6 ].

In the previous approaches to the task we treated our feature vectors (calculated values from metrics) as an N dimensional continuous space with Euclidean coordinates. In this year apporach we will de ne a set of hand crafted distance matrices with non-Euclidean coordinates, which can be used during the clustering.

RUNS

2.1

Run1: Visual based re-ranking

In the rst run participants could use only visual based descriptors or own descriptors calculated using only the images.

For the rst run we use the following approach: step 1 calculating F ACE descriptor for each image, step 2 - lter the images using F ACE and CN [0] descriptors, step 3 creating a distance matrix from color similarity, step 4 doing spectral clustering using the distance matrix, step 5 using the cluster information create the new result list.

Our main approach was using color based distances [ 1, 5 ] and ltering photos with faces [ 7, 6 ]. We used two of the descriptors provided by the organizers [ 2 ]: Global Color Moments on HSV Color Space (CM): represent the rst three central moments of an image color distribution: mean, standard deviation and skewness; Global Color Naming Histogram (CN): maps colors to 11 universal color names: "black", "blue", "brown", "grey", "green", "orange", "pink", "purple", "red", "white", and "yellow".

First we calculated a new descriptor for each image: the F ACE descriptor is the ratio of the calculated area occupied by the possible face regions on an image and whole image area [ 7 ]. Then we used the CN descriptor to lter out black color based images, since mostly dark images tend to have less colors and those are mainly shifted into the gray range rather than having bright colors.

In the reordering step we started from the original result. We did our initial ltering by putting images to the end of the result list where F ACE > 0 or CN [0] > 0:8, the rst value in CN corresponds to the color black.

After the preprocessing step we built the distance matrix F , between each A and B images the distance was calculated using the following equation:

10 FA;B = X jCNA[i] i=0

10 CNB[i]j + X jsi (CMA[i] i=0

CMB[i])j si = 8<15: 5,,wwhheerree0366ii<<35 :0:5, where 5 6 i < 9

After the distance matrix was created we used unsupervised spectral clustering [ 3, 4 ] to create clusters from the rst 150 images, the target cluster count was 10.

The nal result was generated by picking the lowest ranking item from each cluster, appending those to the result list, then repeating this until all the items are used. The same clustering and sorting method was used during run2 and run3. 2.2

Run2: Text based re-ranking

The second run was the text based re-ranking which is accomplished using the title, tags and description elds of each image.

For the second run we use the following approach: step 1 - ltering stop words and characters, step 2 - creating a distance matrix from text similarity, step 4 - doing spectral clustering using the distance matrix, step 5 - using the cluster information create the new result list.

As a preprocessing step we executed a stop word ltering. We also removed some special characters (namely: .,-:;0123456789() @) and HTML speci c character sets (&, " and everything between < and >), then we used the remaining text as the input for a simple TF-IDF calculation [ 9 ].

We calculated the distance between images (e.g. description elds) A and B in the following manner. We initialize distance GA;B to zero and compared A and B at the term level. All occurring t terms in document A compared with all terms in the document B and so on. If term t is contained by both documents, then GA;B will not be increased. If t contained by only one document, we take into consideration the document frequency (DFt): if DFt < 5, then it is a rare term and GA;B should be increased by 2; if DFt > DN=4, then it is a common term and GA;B should be increased by 0:1 (where DN is the total number of documents). If the term is not common nor rare, then we added the DFt=DN to the distance.

Using the three text descriptors we created a weighted sum for the eld distances, where the empirically determined weights are as follows: title=1, tags=2, description=0:5. From these GA;B values we created the G distance matrix. 2.3

Run3: Multimodal re-ranking

In the third run both visual and textual descriptors could be used to create the results.

For the third run we use the following approach: step 1 - creating the distance matrix F (see Section 2.1), step 2 - creating the distance matrix G (see Section 2.2), step 2 creating a new distance matrix from combining F and G, step 4 - doing spectral clustering using the distance matrix, step 5 - using the cluster information create the new result list.

We used our visual distance matrix F and text distance matrix G and created a new aggregate matrix H. This matrix is simply the sum of the corresponding values from both F and G matrix. We tried di erent kind of weighting methods, but the pure matrices supplied the best results on the development set. 2.4

Run4: Credibility based re-ranking

In the fourth run participants were provided with credibility descriptors [ 2 ].

Using the original result we ltered the images by users who had f aceP roportion more than 1:3 to create the same e ect as we did with the F ACE descriptor.

With the purpose of increasing the diversity we used the locationSimilarity descriptor, if this value exceeds the threshold of 3:0 we excluded the image. Despite our simple approach we had great results on the development set. Visual single Visual mul7 Text single Vistext single Vistext mul7 Cred single Text mul7 Cred mul7

The 2015 dataset contained 153 location queries (45,375 Flickr photos) as the development set, we used this to develop our approach, all methods and thresholds were calculated using the whole development set.

The test set containing 139 queries: 69 one-concept location queries (20,700 Flickr photos) and 70 multi-concept queries related to events and states associated with locations (20,694 Flickr photos). Single-topic queries are basic formulations such as the name of a location, multi-concept queries are more complex, they are related to events and states associated with locations (like 'sunset in the city').

Our results can be seen in Table 3. and the F1 metrics can be seen in Figure 1, we listed the single and multi-concept based results separately. 4.

CONCLUSION AND FUTURE WORK

As one can see the visual information based results are the best among all the runs. In the development set we experienced that the textual information for many images are missing or do not describe the content very well. It is not uncommon that an author gives the same textual information to all of the images in a topic.

The credibility based descriptors are proved to be much more useful than we initially thought, in the future we should focus on those to improve textual and visual descriptor based results.

[1]

Datta ,

Joshi ,

Li ,

and J. Z.

Wang . Image retrieval: Ideas, in uences, and trends of the new age . ACM Comput. Surv. , 40 ( 2 ):5: 1 {5: 60 , May 2008 .

[2]

Ionescu ,

A. L.

Ginsca ,

Boteanu ,

Popescu ,

Lupu , and

Mu ller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation . In Working Notes Proceedings of the MediaEval 2015 Workshop , Wurzen, Germany, September 14-15, CEUR-WS.org, 2015 .

[3]

Ma , W. Wan, and

Jiao . Spectral clustering ensemble for image segmentation . In Proceedings of the First ACM/SIGEVO Summit on Genetic and Evolutionary Computation , GEC '09 , pages 415 { 420 , New York, NY, USA, 2009 . ACM.

[4]

A. Y.

Ng ,

M. I.

Jordan , and

Weiss . On spectral clustering: Analysis and an algorithm . In Advances in neural information processing systems , pages 849 { 856 . MIT Press, 2001 .

[5]

M. L.

Paramita ,

Sanderson , and

Clough . Diversity in photo retrieval: Overview of the imageclefphoto task 2009 . In Proceedings of the 10th International Conference on Cross-language Evaluation Forum: Multimedia Experiments, CLEF'09 , pages 45 { 59 , Berlin, Heidelberg, 2010 . Springer-Verlag.

[6]

Paroczi ,

Fodor , and

Szucs . Dclab at mediaeval2014 search and hyperlinking task . In Working Notes Proceedings of the MediaEval 2014 Workshop , Barcelona, Spain, October 16 -17, CEUR-WS. org, ISSN 1613-0073 , 2014 .

[7]

Szu }cs,

Paroczi , and

Vincz . Bmemtm at mediaeval 2013 retrieving diverse social images task: Analysis of text and visual information . In Working Notes Proceedings of the MediaEval 2013 Workshop , Barcelona, Spain, October 18 -19, CEUR-WS. org, ISSN 1613-0073 , 2013 .

[8]

Taneva ,

Kacimi , and

Weikum . Gathering and ranking photos of named entities with high precision, high recall, and diversity . In Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM '10 , pages 431 { 440 , New York, NY, USA, 2010 . ACM.

[9]

J.-B.

Yeh and C.-H. Wu . Video news retrieval incorporating relevant terms based on distribution of document frequency . In Proceedings of the 9th Paci c Rim Conference on Multimedia: Advances in Multimedia Information Processing, PCM '08 , pages 583 { 592 , Berlin, Heidelberg, 2008 . Springer-Verlag.