=Paper=
{{Paper
|id=Vol-1263/paper64
|storemode=property
|title=SAIVT-ADMRG @ MediaEval 2014 Social Event Detection
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_64.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/DenmanDFS14
}}
==SAIVT-ADMRG @ MediaEval 2014 Social Event Detection==
SAIVT-ADMRG @ MediaEval 2014 Social Event Detection Simon Denman David Dean Clinton Fookes SAIVT Laboratory SAIVT Laboratory SAIVT Laboratory Queensland University of Queensland University of Queensland University of Technology Technology Technology Brisbane, Australia Brisbane, Australia Brisbane, Australia s.denman@qut.edu.au d.dean@qut.edu.au c.fookes@qut.edu.au Sridha Sridharan SAIVT Laboratory Queensland University of Technology Brisbane, Australia s.sridharan@qut.edu.au ABSTRACT task 1 (event based clustering of the media collection). The This paper outlines the approach taken by the Speech, Au- remainder of this paper is structured as follows: Section 2 dio, Image and Video Technologies laboratory, and the Ap- outlines the proposed approach; Section 3 presents and dis- plied Data Mining Research Group (SAIVT-ADMRG) in cusses our results; and Section 4 concludes the paper. the 2014 MediaEval Social Event Detection (SED) task. We participated in the event based clustering subtask (subtask 2. PROPOSED APPROACH 1), and focused on investigating the incorporation of image We aim to explore the use of image features for social features as another source of data to aid clustering. In par- event detection. We use the text processing based approach ticular, we developed a descriptor based around the use of of [5] to combine meta-data (text data, time-stamp, and super-pixel segmentation, that allows a low dimensional fea- location information) and with visual features. We em- ture that incorporates both colour and texture information ploy the BoVW approach for generating a visual descriptor. to be extracted and used within the popular bag-of-visual- Our baseline approach uses the SIFT descriptor extracted in words (BoVW) approach. dense manner (with a bin size of 4 and a step size of 8) with K-means used to generate a codebook. A limitation with 1. INTRODUCTION SIFT is its high dimensionality, necessitating a large dictio- nary and high memory requirements, and the fact that it The Social Event Detection (SED) task at MediaEval 2014 ignores colour information. To alleviate this, we propose a [4] is concerned with the detection and retrieval of events new feature based on super-pixel segmentation. Super-pixel from large multimedia collections. A key component of this segmentation aims to segment an image into a set of related social media is image and video data, which typically con- pixels, such that each super-pixel is formed by a set of con- tains images or videos of the events taking place. However, nected and similar pixels (see Figure 1). We use the SLIC in previous editions limited attention has been given to this approach of [1] to extract super-pixels, and set the target data source. For instance in the 2013 evaluation, only two of super-pixel size to 20, to ensure that features are extracted the approaches sought to incorporate image features and in from a similar size image patch as dense SIFT. From each both cases they simply applied well established techniques. resultant super-pixel, we extract a set of features to describe Motivated by this, we seek to investigate the use of visual it’s colour and texture. The colour component is the aver- features to aid social event detection and clustering. age colour of the super-pixel in LAB colour space divided by A limitation of existing widely used approaches such as a normalisation factor, C. The role of C is to ensure that SIFT [2] is the high dimensionality (32 dimensions), which the colour and texture information contribute approximately leads to increased memory demands, and the need for large equally to the feature vector, and is set empirically using the codebooks when used in a BoVW framework. Furthermore, development set. The texture component is a HOG descrip- descriptors such as SIFT use greyscale images, discarding tor computed from all pixels in the super-pixel. We use an colour information, and although SIFT descriptors can be 8-bin histogram, and do not perform any normalisation prior computed across multiple channels to incorporate colour, to computing the HOG. this further increases dimensionality. Motivated by this, we The resultant feature vector for each super-pixel can then propose a new a low dimensional descriptor that incorpo- be given as: rates both colour and texture information though the use of super-pixel segmentation. We combine this approach with F = {FL , FA , FB , FHOG,0 , FHOG,1 , FHOG,2 , an existing text processing system [5] and evaluate it on sub- (1) FHOG,3 , FHOG,4 , FHOG,5 , FHOG,6 , FHOG,7 }, where FL , FA and FB are the LAB colour features; and Copyright is held by the author/owner(s). MediaEval 2014 Workshop, Oc- [FHOG,0 ..FHOG,7 ] are the 8 bins of the HOG histogram. tober 16-17, 2014, Barcelona, Spain We utilise these features within the BoVW framework to Run F1 NMI Div. F1 1 0.7443 0.8993 0.7426 2 0.7525 0.9018 0.7508 3 0.7517 0.9017 0.75 4 0.7523 0.9018 0.7506 5 0.7525 0.9018 0.7509 Table 1: Results for the five runs for subtask 1. Re- fer to Section 3.1 run descriptions. (a) (b) larger codebook used in 3 resulted in overfitting and thus a Figure 1: An example of super-pixel segmentation poorer representation. The use of Fisher Vectors [3] instead using the SLIC algorithm. Note that larger super- of K-means also leads to a small improvement, as can be pixels are shown here for visualisation purposes. seen by the improvement from systems 4 to 5. It should be noted that a Fisher Vector encoding could not be produced build an image descriptor. A codebook is trained (using K- for the SIFT features, even with a much smaller dictionary means or Fisher Vectors [3]) using features extracted from size, due to the higher dimensionality of the feature and several thousand images. Subsequent images are then en- larger memory requirements of the training process. coded using this codebook to generate a descriptor that en- We observe that with the exception of system 5, the dense capsulates the content of the images; and these descriptors SIFT approach of system 2 outperforms systems using the are compared to one another using Euclidean distance. proposed feature (3 and 4). However, the proposed ap- Finally, text and visual features are combined in the fol- proach has a much lower memory footprint than the SIFT lowing manner: descriptor (for instance dense SIFT features extracted from the training data require 254GB of storage, while using the sim(d, p) = β1 simcosine (d, p) + β2 simtime (d, p)+ proposed approach requires only 10GB), leading to signifi- (2) β3 simgps (d, p) + β4 simimage (d, p), cant improvements in computational efficiency when learn- ing codebooks, and encoding features. where simcosine (d, p), simtime (d, p) and simgps (d, p) are the similarity of the text, timestamps and GPS locations as com- 4. CONCLUSIONS AND FUTURE WORK puted by [5]; simimage (d, p) is the similarity of the image We have described our submission to the MediaEval 2014 features; and βi are weight parameters used to combine the SED task. Our approach uses a new feature representation different data sources. These weight parameters are learnt for images, which we utilize with the popular bag-of-words from the training data to maximise clustering accuracy on framework. This has been shown to offer comparable per- the training set. Entries are then clustered using the con- formance to the SIFT descriptor, at much greater compu- strained method of [5], which uses document ranking to tational and memory efficiency. Future work will continue choose a neighbourhood of best candidates from which the to investigate the proposed approach. Factors such as the best match is chosen. normalisation of colour and HOG features, the number of orientation bins, and the size of the super-pixels will all be 3. EVALUATION investigated. Furthermore, the method used to combine the visual data with the meta-data will be further investigated 3.1 Runs and refined to better utilise the visual information. Our five systems are as follows: 5. ACKNOWLEDGMENTS 1. Metadata only: an implementation of [5]. We would like to thank Taufik Sutanto and Richi Nayak 2. Metadata + SIFT/K-means/1000: Meta-data combined from the ADMRG at QUT for their assistance in completing with an image representation using SIFT features and this evaluation. a 1000 word K-means codebook. 3. Metadata + proposed super-pixel feature (SP)/ K- 6. REFERENCES [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and means/1000: Meta-data combined with an image rep- S. Susstrunk. SLIC superpixels compared to state-of-the-art resentation using the proposed feature and a 1000 word superpixel methods. PAMI, 34(11):2274–2282, 2012. K-means codebook. [2] D. G. Lowe. Object recognition from local scale-invariant features. In ICCV, volume 2, pages 1150–1157, 1999. 4. Metadata + SP/K-means/125: As with system 3, ex- [3] F. Perronnin, J. Sánchez, and T. Mensink. Improving the cept the dictionary is now of size 125. fisher kernel for large-scale image classification. In ECCV, 5. Metadata + SP/FV/125: As with system 4, except pages 143–156. Springer, 2010. Fisher Vector encoding [3] is used instead of K-means. [4] G. Petkos, S. Papadopoulos, V. Mezaris, and Y. Kompatsiaris. Social event detection at MediaEval 2014: We use C++ and VLFeat [6] to encode images. Challenges, datasets, and evaluation. In Proceedings of the MediaEval 2014 Multimedia Benchmark Workshop, 2014. 3.2 Results [5] T. Sutanto and R. Nayak. The ranking based constrained document clustering method and its application to social Results for subtask 1 are shown in Table 1. We note that event detection. In Database Systems for Advanced the incorporation of image data does lead to an improve- Applications, pages 47–60. Springer, 2014. ment, albeit only a small one, over the baseline with systems [6] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable 2-5 all outperforming the text only system (1). Of note is library of computer vision algorithms. that system 4 outperforms that of 3, suggesting that the http://www.vlfeat.org/, 2008.