1 Introduction

OHSU @ MediaEval 2015: Adapting7extual7echniquesWo Multimedia Search

Shiran Dudy

dudy@ohsu.edu 0

Steven Bedrick

bedricks@ohsu.edu 0 0 Center for Spoken Language Understanding, OHSU , Portland, Oregon , USA

2015

14 15

In this paper, we present the motivation, process, results and analysis of results that we have worked on as part of our participation in the 2015 MediaEval Retrieving Diverse Social Images Task. This year, we adapted a recently-published technique for result diversi cation (\Relational Learning-toRank" [13]), borrowed from the world of standard document retrieval. As compared to the original work, our version makes certain changes to the ranking and comparison algorithm, and explores a variety of feature combinations speci c to an image retrieval context. The key idea behind our technique was a greedy iterative approach to ranking search results, which attempted to balance relevance with redundancy by comparing candidate results to those already selected by the algorithm. Our approach worked tolerably well on many queries, but there is clearly room for improvement.

1 Introduction

Imagine you are in Munich, and it's the time of year when everybody is talking about Oktoberfest. Being unfamiliar with this festival, you perform an image search, to try and nd out whether you'd like the event, and to discover what to expect should you attend. Unfortunately, your results consist of two hundred very similar images, all of the inside of beer tents. While certainly relevant, these results only show a small slice of what Oktoberfest is about: where are the parades, the concerts, the fairgrounds? A more diverse set of search results would have been much more useful in this situation.

The Retrieving Diverse Social Images task at the 2015 MediaEval workshop required participants to provide the most diverse and relevant images given a search query like \Oktoberfest." The organizers provided a detailed task description along with data set for development and evaluation, described fully in [ 6 ]. Our team chose to adapt a recent technique for search result diversi cation [ 13 ] that adapts \traditional" learning-to-rank methods [ 8 ] to incorporate diversity into its loss function. Search result diversi cation is a very active area of research in information retrieval. In principle, the general problem of identifying an optimal ranking that balances both relevance and diversity has been shown to be NP-complete [ 1 ], which means that most techniques rely on approximations of one kind or another. One common family of approximations descend from the greedy and iterative Maximal and Marginal Relevance approach [ 3 ], in which each successive document is chosen based on its similarity to the user's query and its dissimilarity to the set of already-chosen documents.

Another family of approaches directly model attributes of the query and of the documents, and then identify sub-sets of results that are representative of di erent combinations of attributes. For example, Agrawal et al. [ 1 ] use a taxonomy to model documents and queries, and identify a set of results that thoroughly covers the entries represented by the retrieved documents.

The approach our group used in this year's MediaEval fuses elements of both families of techniques. It is based on a paper by Zhu et al. [ 13 ] that describes an extension of Learning-to-Rank (LtR). Traditional LtR consists of learning a ranking function that attempts to assign a rank to a particular document given a particular query. Zhu et al.'s extension, \Relational Learning-to-Rank" (R-LtR), models result ranking as a sequential selection process, and their formulation incorporates knowledge about not only the document in question and the query, but also the set of documents that have already been selected. 3

Methodology

For a complete description of R-LtR, we refer the reader to the original paper [ 13 ]. In brief, R-LtR is an iterative scoring method that takes into account both the relevance of a textual document along with information about how similar it is to documents that have already been chosen. The algorithm represents documents as arbitrary feature vectors. Each successive document is scored against the documents that have already been chosen according to the following scoring function (equation 2 in [ 13 ]):

fs(xi; Ri) = wrT xi+wdT hs(Ri); 8xi 2 XnS (1) This scoring function combines information on relevance and diversity given the candidate document xi (represented as a k-dimensional feature vector) and its \diversity matrix" Ri. This matrix is actually a \slice" of a three-way tensor mapping documents to documents along features; each value Rijk represents the relationship between documents i and j in terms of feature k. For example, if we were to use the Jaccard similarity metric as our rst feature, Ri;j;1 would consist of the Jaccard similarity of documents xi and xj. This formulation allows us to combine entirely arbitrary features and relational functions.

Note further that Ri in equation 1 is de ned as including all documents xj 2 S, where S is the set of documents that have already been chosen out of the set of all possible documents, X. In other words, Ri contains information relating document xi to the already-selected documents. XnS refers to the remaining set of not-yet-selected documents. hs(Ri) refers to a relational function comparing document xi to the entire set of documents in S.1

Finally, wr and wd are weight vectors corresponding to the relative weights of relevance and diversity, respectively. Equation 1 is used for prediction (i.e., scoring); Zhu et al. outline a training process that uses stochastic gradient descent to learn learn values for wr and wd. For reasons of space, we will not discuss training in this paper, and refer the reader to the full description in [ 13 ]. Note that, unlike Zhu et al., in this year's task we were given results that are already sorted in terms of \relevance" (according to Flickr's search engine). As such, we were able to simplify the precise algorithm described by Zhu et al., as we were able to use this existing relevance information instead of computing our own from scratch.

In order to adapt this algorithm to the image search domain, we identi ed combinations of features and appropriate distance metrics based on the shared task data. We represented \textual" information by transforming each image's \tags" and \description" features into a tf-idf-weighted bag-of-words representation, which we then processed using Latent Semantic Analysis (LSA) [ 4 ] to reduce its dimensionality. We also performed Latent Dirichlet Allocation (LDA) [ 2 ] on the tag/description data, in order to attempt to represent topic groups within the results. We computed similarity for these features using L2 (Euclidean) distance; both feature sets were computed using the Gensim package.2

In addition to the textual features, we utilized several of the visual features provided by the shared task. Along with their distance metrics, we used \csd" (L2) [ 10 ], \hog" (Bhattacharyya distance) [ 11 ], \cn" (L2) [ 7 ], \cm" (Canberra distance) [ 5 ], \lbp" ( 2) [ 12 ], and \glr" (L1 Manhattan) [ 9 ]. All features were normalized such that larger values for the distance functions represented either higher degrees of similarity or higher degrees of diversity (for the values in R). For our run including user credibility data, we included \visualScore", \faceproportion", \tagSpeci city", \uniqueTags", and \locationSimilarity".

4 Submitted Runs

We trained four di erent models. The rst, run 1, used only image (visual) features. run 2 used the textual features described above (LSA and LDA on descriptions and tags). run 4 and run 5 combined both image and textual features with user credibility information. The textual features remained the same across runs; runs 4 and 5 experimented with using global image features (i.e., calculated on the entire image) versus features computed locally on image quadrants. 5

Results & Discussion

Our results are summarized in Table 1. Our visual-featureonly run (run 1) outperformed our text-feature-only run (run 2) in terms of Cluster Recall @ 20, but interestingly, not in terms of Precision @ 20. Incorporating textual and user information (run 4) did not seem to substantially alter our 1Zhu et al. propose several di erent methods of combining the data stored in Ri: taking the minimal distance (i.e., for all features k, taking minxj2S Rijk), averaging, or taking the maximum distance. 2http://radimrehurek.com/gensim/ run 1 run 2 run 4 run 5 0.46 0.42 0.46 0.41 all CR Our adaptation of R-LtR to an image retrieval task shows that this approach to result diversi cation can work with a wide variety of features and distance metrics. Our results are promising, though clearly much work remains to be done in terms of feature engineering and parameter tuning. We also hope to extend the algorithm to include more adaptable feature weight vectors, to enable the system to give di erent weight to certain features (e.g., textual or visual feature subsets) depending on query or image characteristics. R-LtR is a exible and powerful platform from which to begin such experiments.

[1]

Agrawal ,

Gollapudi ,

Halverson , and

Ieong . Diversifying search results . In WSDM '09: Proceedings of the Second ACM International Conference on Web Search and Data Mining , pages 5 { 14 , New York, New York, USA, Feb. 2009 . ACM.

[2]

D. M.

Blei ,

A. Y.

Ng , and

M. I.

Jordan . Latent dirichlet allocation . The Journal of Machine Learning Research , 3 : 993 { 1022 , Mar . 2003 .

[3]

Carbonell and J. Goldstein . The use of MMR, diversity-based reranking for reordering documents and producing summaries . In SIGIR '98: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval , pages 335 { 336 , New York, New York, USA, Aug. 1998 . ACM Request Permissions.

[4]

S. C.

Deerwester ,

S. T.

Dumais ,

T. K.

Landauer ,

G. W.

Furnas , and

R. A.

Harshman . Indexing by latent semantic analysis . JASIS , 41 ( 6 ): 391 { 407 , 1990 .

[5]

Z.-C.

Huang ,

P. P.

Chan ,

W. W.

Ng , and

D. S.

Yeung . Content-based image retrieval using color moment and gabor texture feature . In 2010 International Conference on Machine Learning and Cybernetics (ICMLC) , volume 2 , pages 719 { 724 . IEEE, 2010 .

[6]

Ionescu , A. L. G nsca , B. Boteanu ,

Popescu ,

Lupu , and

Mu ller. Retrieving diverse social images at mediaeval 2015: Challenge, dataset and evaluation . In MediaEval 2015 Workshop, Wurzen, Germany, 2015 .

[7]

H. Y.

Lee ,

H. K.

Lee , and

Y. H.

Ha . Spatial color descriptor for image retrieval and video segmentation . IEEE Transactions on Multimedia , 5 ( 3 ): 358 { 367 , 2003 .

[8]

T.-Y.

Liu . Learning to rank for information retrieval . Springer, New York, 1st. ed edition , 2011 .

[9]

Selvarajah and

Kodituwakku . Analysis and comparison of texture features for content based image retrieval . International Journal of Latest Trends in Computing , 2 ( 1 ), 2011 .

[10]

Sikora . The mpeg-7 visual standard for content description-an overview . IEEE Transactions on Circuits and Systems for Video Technology , 11 ( 6 ): 696 { 702 , 2001 .

[11]

Yin and

Collins . Object tracking and detection after occlusion via numerical hybrid local and global mode-seeking . In 2008 IEEE Conference on Computer Vision and Pattern Recognition , pages 1 { 8, June 2008 .

[12]

Zhang ,

Huang ,

S. Z.

Li ,

Wang , and

Wu . Boosting local binary pattern (lbp)-based face recognition . In Advances in biometric person authentication , pages 179 { 186 . Springer, 2005 .

[13]

Zhu ,

Lan ,

Guo , X. Cheng, and

Niu . Learning for search result diversi cation . In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval , pages 293 { 302 . ACM, 2014 .