=Paper=
{{Paper
|id=None
|storemode=property
|title=UEC, Tokyo at MediaEval 2013 Retrieving Diverse Social Images Task
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_30.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/YanaiN13
}}
==UEC, Tokyo at MediaEval 2013 Retrieving Diverse Social Images Task==
UEC, Tokyo at MediaEval 2013 Retrieving Diverse Social Images Task Keiji Yanai and Do Hang Nga The University of Electro-Communications, Tokyo 1-5-1 Chofugaoka, Chofu-shi, Tokyo 182-8585, JAPAN yanai@cs.uec.ac.jp, dohang@mm.cs.uec.ac.jp ABSTRACT Markov-chain probabilistic model. In this paper, we describe our method and results for the VisualRank uses a similarity matrix of images instead of MediaEval 2013 Retrieving Diverse Social Images Task. To hyper-link structure. Eq.(1) represents an equation to com- accomplish the task objective, we adopt VisualRank [5] and pute VisualRank. Ranking with Sink Points [2], which are common methods ri+1 = αSri + (1 − α)p, (0 ≤ α ≤ 1) (1) to select representative and diverse photos. To obtain an affinity matrix for both ranking methods, we used only the S is the column-normalized similarity matrix of images, p officially-provided features including visual features and tag is a damping vector, r is the ranking vector each element of features. We submitted three required runs including only which represents a ranking score of each image, and α plays visual feature run, only textual feature run and textual- a role to control the extent of effect of p. The final value of r visual fused feature run. is estimated by updating r iteratively with Eq.(1). Because S is column-normalized and the sum of elements of p is 1, the sum of ranking vector r does not change. Although p 1. INTRODUCTION is set as a uniform vector in VisualRank as well as normal In this paper, we describe our method and results for the PageRank, it is known that p can plays a bias vector which MediaEval 2013 Retrieving Diverse Social Images Task [4]. affects the final value of r [3]. The objective of this task is to select relevant and diverse photos from the given photos regarding the specific loca- 2.2 Ranking with Sink Points tions. To do that, we adopt VisualRank [5] and Ranking Because VisualRank is a ranking method considering only with Sink Points [2]. The reason why we adopted these representativeness of items, higher ranks are sometimes oc- method is that we had used these methods for ranking geo- cupied with items which are similar to each other. This is, tagged photos [6]. First we calculate a similarity matrix VisualRank cannot accomplish ranking considering diversity using the given features, and we apply VisualRank to select of items. Therefore, we adopt Ranking with Sink Points [2], the most representative photo. Then we re-rank the remain- which can be regarded as an extension of PageRank [1] to ing photos by Ranking with Sink Points after removing the make obtained ranking relevant and diverse. first-ranked photo. We repeat re-ranking by Ranking with To address the diversity in ranking, the concept of sink Sink Points and removing the first-ranked photos until 50 points is useful. The sink points are data objects whose photos are selected. ranking scores are fixed at zero during the ranking process. To obtain a similarity matrix for both ranking methods, Hence, the sink points will never spread any ranking score to we used only the officially-provided features including visual their neighbors. Intuitively, we can imagine the sink points features and tag features. We submitted three required runs as the “black holes” on the ranking manifold, where ranking including only visual feature run, only textual feature run scores spreading to them will be absorbed and no ranking and textual-visual fused feature run, which are the minimum scores would escape from them. requirements to participate this task. First we apply VisualRank to select the most represen- tative photo with the obtained affinity matrix. Then we 2. RANKING METHOD re-rank the remaining photos by Ranking with Sink Points To obtain representative and diverse photos in the up- as shown in Eq.(2), after removing the first-ranked photo as per rank, we adopt VisualRank [5] and Ranking with Sink “a sink point”. We repeat re-ranking by Ranking with Sink Points [2]. In this section, we explain both methods and Points and removing the first-ranked photos until 50 photos features briefly. are selected as following: 2.1 VisualRank ri+1 = αSIi ri + (1 − α)p (2) VisualRank is an image ranking method based on PageR- Ii is an indicator matrix which is a diagonal matrix with its ank [1]. PageRank calculates ranking of Web pages using (i, i) − element equal to 0 if xi ∈ Xs and 1 otherwise. Xs is hyper-link structure of the Web. The rank values are esti- a set of “sink points”. mated as the steady state distribution of the random-walk Note that α is set as 0.85 in the experiments. 2.3 Visual Features Copyright is held by the author/owner(s). We used the ten kinds of visual features officially provided MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain by the task organizers such as Global Histogram of Oriented Table 1: Results evaluated by experts for the entire test set of 346 locations. Runs P@5 P@10 P@20 P@30 P@40 P@50 Only visual 0.7164 0.7056 0.7092 0.7076 0.6948 0.6752 Only textual 0.7082 0.6863 0.6845 0.6904 0.6841 0.6667 Visual-textual 0.7135 0.7155 0.7063 0.7026 0.6934 0.6723 Runs CR@5 CR@10 CR@20 CR@30 CR@40 CR@50 Only visual 0.2233 0.3633 0.5448 0.6743 0.7572 0.8154 Only textual 0.213 0.3579 0.5515 0.6706 0.7549 0.8094 Visual-textual 0.2258 0.3621 0.5414 0.6642 0.7427 0.8015 Figure 1: An example of the ranking by the three Runs F1@5 F1@10 F1@20 F1@30 F1@40 F1@50 kinds of features: “The Gate of Forbidden City in Only visual 0.3288 0.4617 0.5926 0.6618 0.6936 0.7068 China”. Only textual 0.318 0.4531 0.5869 0.6544 0.689 0.7001 Visual-textual 0.3303 0.4614 0.5879 0.6545 0.6869 0.6995 Gradient and Color Moments on HSV Color Space. The Table 2: Results evaluated by crowd (the average detail on official visual features is explained in [4]. of GT1, GT2 and GT3) for a subset of 50 locations With histogram intersection, we calculate similarities for from the test set. each of visual features. Finally we construct an affinity ma- Runs P@5 P@10 P@20 P@30 P@40 P@50 trix by averaging similarity on ten kinds of visual features. Only visual 0.7061 0.6959 0.6857 0.6878 0.6847 0.6845 Only textual 0.6857 0.6673 0.6847 0.6966 0.6964 0.6865 Visual-textual 0.6367 0.6531 0.6653 0.6823 0.6765 0.6747 2.4 Textual Features We use social TF-IDF weights provided by the task orga- Runs CR@5 CR@10 CR@20 CR@30 CR@40 CR@50 nizers. We extract bag-of-words vectors from Flickr meta- Only visual 0.5947 0.7198 0.8070 0.8803 0.9153 0.9394 Only textual 0.5875 0.7331 0.8429 0.9118 0.9355 0.9449 data with social TF-IDF weights for all the given images. Visual-textual 0.5573 0.6824 0.8050 0.8829 0.9223 0.9447 We calculate an affinity matrix with cosine similarity be- tween bag-of-words vectors within each place. Runs F1@5 F1@10 F1@20 F1@30 F1@40 F1@50 Only visual 0.5915 0.6657 0.7052 0.7435 0.7586 0.7675 To obtain an affinity matrix for the visual-textural-fused Only textual 0.5818 0.6659 0.7366 0.7669 0.7770 0.7735 runs, we simply averaged both visual-feature-based affinity Visual-textual 0.5441 0.6261 0.6971 0.7446 0.7578 0.7638 matrix and textual-feature-based affinity matrix. 3. EXPERIMENTAL RESULTS ing dataset, GPS data coordinates and Wikipedia photos. Tables 1 and 2 show the evaluated results of our three In fact, if you had enough time, we should have used the submission runs by experts and crowds, respectively. Note training data for estimating optimal parameters such as α that the results by experts is based on evaluation for the en- in the VisualRank formulation and a mixing weight of visual tire dataset of 346 locations, while the results by the crowds similarity and textual similarity. is based on evaluation for only 50 locations in the dataset and are obtained by averaging evaluations by three crowd persons. 5. REFERENCES Basically, the results by only visual were better than the [1] S. Brin and L. Page. The anatomy of a large-scale results by only textual and the results by visual-textual, hypertextual web search engine. In Proc. of the Seventh although the difference were not so large. International World Wide Web Conference, 1998. We show the top six photos of an successful example by [2] X.-Q. Cheng, P. Du, J. Guo, X. Zhu, and Y. Chen. the proposed method with three kinds of features: textual, Ranking on data manifold with sink points. IEEE visual and visual-textual features in Figure 1. These photos Transactions on Knowledge and Data Engineering, represents “The Gate of Forbidden City in Beijing, China.” 25(1):177–191, 2013. In this example, the photos selected by the visual-textual [3] T. Haveliwala. Topic-sensitive PageRank: A feature is more representative and diverse than the photos context-sensitive ranking algorithm for web search. selected by the only textual or only visual features. This IEEE trans. on Knowledge and Date Engneering, indicates that our proposed methods works successfully. 15(4):784–796, 2003. In the case of the above example, most of the photos in- [4] B. Ionescu, M. Menendez, H. Muller, and A. Popescu. cluded in the given photo set are relevant and only a few Retrieving diverse social images at mediaeval 2013: noise photos are included. However, given photo sets of Objectives, dataset and evaluation. In MediaEval 2013 some landmark include many noise photos. In such case, the Workshop, CEUR-WS.org, ISSN: 1613-0073, proposed methods sometimes failed to select relevant photos Barcelona, Spain, October 18-19 2013. and selected noise photos in the upper ranking. Therefore, [5] Y. Jing and S. Baluja. Visualrank: Applying pagerank removal of noise photos is one of our important future works. to large-scale image search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 4. CONCLUSIONS 30(11):1870–1890, 2008. We tackled MediaEval 2013 Retrieving Diverse Social Im- [6] H. Kawakubo and K. Yanai. Geovisualrank: A ranking ages Task with VisualRank [5] and Ranking with Sink Points [2]. method of geotagged images considering visual Unfortunately, due to time limitation, we had to give similarity and geo-location proximity. In Proc. of the up using some useful additional data including the train- ACM International World Wide Web Conference, 2011.