Geotagging Flickr Photos And Videos Using Language Models Sanket Kumar Singh Davood Rafiei University of Alberta University of Alberta Edmonton, AB, Canada Edmonton, AB, Canada ABSTRACT ity among users in describing a place, but we experiment This paper presents an experimental framework for the Plac- with additional components to boost the performance. An- ing tasks, both estimation and verification at MediaEval other related weighting scheme is that of Aibek et. al in [6] Benchmarking 2016. The proposed framework provides re- which uses the Kullback-Leibler divergence to differentiate sults for four runs - first, using metadata (such as user tags between class-specific and general terms. Even though that and title of images and videos), second, using visual features work is done in a different context, we experiment with this extracted from the images (such as tamura), third, by using model in identifying location specific tags. the textual and visual features together and fourth, using metadata as in the first run but with the training data aug- 3. PROPOSED APPROACH mented with external sources. Our work mainly focusses on The proposed framework consists of two phases: (1) pre- textual features where we develop a language-based model processing the placing dataset [1], and (2) building the model using bag-of-tags with neighbour based smoothing. The ef- and doing the predictions. fectiveness of the framework is evaluated through experi- Preprocessing Each photo or video has a title, some ments in the placing task. user tags and the id of a user who posted it. After remov- ing punctuations and special characters from the title, the 1. INTRODUCTION remaining terms are included in the tag set. This helps the The goal of this work is to estimate the coordinates of an cases where a photo has a title but no tags. In Run 4, which image or a video on the world map and to verify whether an is also based on textual metadata, we include in our training image belongs to a given location. Tags assigned to a photo photos instances extracted from the YFCC100M [9] dataset may not be location-specific and even the location-specific which are uploaded by users other than those in our test set. tags can be vague and may refer to multiple locations. Some Furthermore, we augment the tag set with place names from photos have no tags or have only tags that have not been Geonames [10] and assign the location tags to cells based on seen before (e.g. in the training phase). All these issues location coordinates. In all run, each tags that is used by make location prediction from user tags challenging. We only one user is removed to reduce noise, and the remaining address these problems by learning the associations between tags are then used for training. For testing, we only use user tags and locations and by using this information in our user tags in each run except for Run 4, where we addition- prediction. ally use title and description, for those test instance which have no user tag or none of the tag are found in train data. Our goal in Run 4 is to use as much data as possible. To 2. RELATED WORK build a model for Run 2 (which uses visual features), we use Language modeling is used in placing photos on a map. 2,182,400 images with Tamura [8] features; the features are In particular, Pavel et. al [7] place a grid of fixed degree preprocessed so they can be fed into Vowpal Wabbit (VW) over the world map and map train instances to cells based [5], which is used to train the model. The dataset has 2,735 on their coordinates.They learn a model which allows them counties and these are used as labels for training; for our to predict the location of the test instances on the grid. training, county was the smallest region with enough data Though this work provides several smoothing techniques to points per label (812 on average compared to 38 for town). predict the location of a test instance whose tags are not Methodology For the estimation task, we place a grid of seen, it does not differentiate between general and location 1, 0.1 and 0.01 degrees and predict a cell c for each test photo specific tags. Giorgos et. al in [4] use a similar model but based on a generative model which estimates the probability capture information regarding how many users use a par- ppti |cq that the tags ti in the photo are emitted from cell c. ticular tag in a particular region. Additionally, they use The model captures the degree at which a tag is popular Shannon's Entropy to give small weights to tags which are among users in describing a location within a cell, i.e. user specific or general. Our base model is the same, as number of user who use tag ti in cell c it provides a weighting of each tag based on its popular- ppti |cq “ , number of user who use tag ti globally Copyright is held by the author/owner(s). n ÿ MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- ppT |cq “ ppti |cq lands. i“1 1000(km) where n is the number of tags in a test instance T. A cell 100(km) 10(km) 100(m) c that gives the maximum p(T|c) is considered as predicted Media 1(km) 10(m) Run cell of the test instance T. We further extend the base model by performing a neighbour based smoothing as in [7], taking into account who use tag t in the neighbouring cells of cell photo 0.27 2.88 14.13 35.28 50.28 64.17 c. Since we need to estimate the actual coordinates of a test 1 video 0.27 3.03 13.50 33.24 47.60 60.08 instance within a cell, we use the coordinate of a training photo 0.0 2.0 0.42 2.13 4.0 22.97 instance in the same cell that has the maximum Jaccard 2 video 0.0 0.0 0.14 0.81 1.77 6.95 similarity to the test instance. photo 0.27 2.89 14.13 35.26 50.25 64.03 Test instances that have no tags (or their tags are not seen 3 video 0.27 3.03 13.50 33.24 47.60 60.08 in the training set) are assigned to the cell with the largest photo 0.27 2.94 13.24 33.02 51.14 64.58 number of training instances. In this case, the coordinates 4 video 0.27 3.36 13.29 32.61 49.35 61.18 of the training instance which has the minimum Karney's distance [3] to other instances in the cell is considered as Table 1: Precision at different distances (in %) the estimated coordinate. To use visual features, we train a one-against-all multiclass model using VW to predict a Run Media ADE(km) MDE(km) VA county for the test instance. The coordinates are estimated photo 2452.24 93.49 0.64 using the same strategy as before, based on the coordinates 1 video 2744.00 185.10 0.62 of a training instance. Since textual features provide a more photo 5243.38 5738.70 0.50 accurate estimation, visual features are used in Run 3 only 2 video 5719.46 6374.11 0.50 if a photo has no textual features. Otherwise, only textual features are used. For the verification task, we use the place photo 2451.53 94.23 0.64 3 information of the training instance, used to predict the co- video 2744.00 185.09 0.63 ordinates in the estimation task, and mark a test instance photo 2457.29 82.23 0.64 4 verified if its predicted location string contains the given video 2703.20 114.48 0.63 place name. Table 2: Estimation & Verification results for runs 4. RESULTS AND ANALYSIS We performed our experiments for the estimation task tween different places from where a photo or video is taken using grids of 0.01, 0.1, 1.0 degrees and evaluated the re- and thus model mainly predicts by most popular county. For sults using precision at each distance, average distance error Run 4, we augment the cells with place names from Geon- (ADE), median distance error (MDE) and the verification ames (giving it an arbitrary user id) and from YFCC100M accuracy (VA). The results are listed in Tables 1 and 2. dataset. Since the tags which are used by only one user are From Table 1, we can see that the precision for large dis- removed, only Geonames tags which are used by an actual tances is high as each target cell covers more area and has user in the cell are retained. This increases the count of more tags. Additionally, as we apply our neighbour based place specific tags which are used by real users. Using title smoothing using adjacent cells, more tags from neighbours and description for test instances, which have no user tags are included, which is useful in cases where tags cover wider or their tags are not found in the training set, reduces the area such as tags with province name or geographical di- median distance error for the estimation task. vision which cover more than one grid. This results in an Before reaching the proposed approach, we tried to find loca- improved cell prediction accuracy. tion specific tags by assessing their frequency concentration Analyzing the wrong predictions using the validation set, in a region, as compared to the whole map. This approach, we find that misspellings, mismatches between plural and however, did not work for instances where the same tag was singular forms, and the differences in spelling (such as “bar- equally present at two or more places, that were far from cellona” for “Barcelona”, “nederland” for “Netherlands”) are each other. Further, we used the KL-Divergence to separate some of the causes for the tags not to be found in a correct probability distribution of general tags from location specific cell. Famous spots such as “the Empire State” building in tags but this approach also did not work well as the model New York are easily predicted because of abundant location ended up giving more weights to user specific tags such as specific tags. However, instances with general words such as “lehmans”, “gladston”, etc. “bogus” and “finding” lead to prediction of wrong cells. In an experiment comparing top-k and top-1 predictions for test instances, we found that top-10 accuracy was 47.74% while 5. CONCLUSION AND FUTURE WORK top-1 accuracy was 31.80% (for photos and video together) In this paper, we study the problem of predicting coordi- using 0.1 degree grid („10 km). Furthermore, the predicted nates for multimedia objects. We adopt an approach which cells were closer to the real cells. Another set of instances identifies the tags which are frequently used by users at each that were difficult to predict were 335845 test instances (in- location. This, in turn helps us predict the cell and there- cluding photos and videos) which either had no tags or their after the coordinates for each object. Our analysis of wrong tags were not used by any user in the training set. We assign prediction reveals that true cells are often present is in top-k these instances to the most popular cell, which only gives a and are close to the predicted cell. This seems to be an area correct prediction for 3751 instances. for improvement, where one needs to disambiguate between For Run2, we use Tamura features to train a multiclass the neighbouring cells, maybe considering cells of varying model using VW. As the dataset consists of different land- sizes or forming clusters based on the closeness of training scapes, animals, places etc., it is difficult to distinguish be- instances. 6. REFERENCES [1] J. Choi, C. Hauff, O. V. Laere, and B. Thomee. The placing task at mediaeval 2016. MediaEval 2016 Workshop, Oct. 20-21 2016. [2] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. [3] C. F. Karney. Algorithms for geodesics. Journal of Geodesy, 87(1):43–55, 2013. [4] G. Kordopatis-Zilos, S. Papadopoulos, and Y. Kompatsiaris. Geotagging social media content with a refined language modelling approach. In Pacific-Asia Workshop on Intelligence and Security Informatics, pages 21–40. Springer, 2015. [5] J. Langford, L. Li, and A. Strehl. Vowpal wabbit. URL https://github. com/JohnLangford/vowpal wabbit/wiki, 2011. [6] A. Makazhanov, D. Rafiei, and M. Waqar. Predicting political preference of twitter users. Social Network Analysis and Mining, 4(1):1–15, 2014. [7] P. Serdyukov, V. Murdock, and R. Van Zwol. Placing flickr photos on a map. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pages 484–491. ACM, 2009. [8] H. Tamura, S. Mori, and T. Yamawaki. Textural features corresponding to visual perception. IEEE Transactions on Systems, Man, and Cybernetics, 8(6):460–473, 1978. [9] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016. [10] M. Wick and C. Boutreux. Geonames. GeoNames Geographical Database, 2011.