=Paper= {{Paper |id=Vol-1263/paper64 |storemode=property |title=SAIVT-ADMRG @ MediaEval 2014 Social Event Detection |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_64.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/DenmanDFS14 }} ==SAIVT-ADMRG @ MediaEval 2014 Social Event Detection== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_64.pdf
     SAIVT-ADMRG @ MediaEval 2014 Social Event Detection

                  Simon Denman                                  David Dean                       Clinton Fookes
                SAIVT Laboratory                           SAIVT Laboratory                    SAIVT Laboratory
              Queensland University of                   Queensland University of            Queensland University of
                    Technology                                 Technology                          Technology
                Brisbane, Australia                        Brisbane, Australia                 Brisbane, Australia
            s.denman@qut.edu.au                           d.dean@qut.edu.au                  c.fookes@qut.edu.au
                                                           Sridha Sridharan
                                                           SAIVT Laboratory
                                                         Queensland University of
                                                               Technology
                                                           Brisbane, Australia
                                                      s.sridharan@qut.edu.au

ABSTRACT                                                                 task 1 (event based clustering of the media collection). The
This paper outlines the approach taken by the Speech, Au-                remainder of this paper is structured as follows: Section 2
dio, Image and Video Technologies laboratory, and the Ap-                outlines the proposed approach; Section 3 presents and dis-
plied Data Mining Research Group (SAIVT-ADMRG) in                        cusses our results; and Section 4 concludes the paper.
the 2014 MediaEval Social Event Detection (SED) task. We
participated in the event based clustering subtask (subtask              2.   PROPOSED APPROACH
1), and focused on investigating the incorporation of image                 We aim to explore the use of image features for social
features as another source of data to aid clustering. In par-            event detection. We use the text processing based approach
ticular, we developed a descriptor based around the use of               of [5] to combine meta-data (text data, time-stamp, and
super-pixel segmentation, that allows a low dimensional fea-             location information) and with visual features. We em-
ture that incorporates both colour and texture information               ploy the BoVW approach for generating a visual descriptor.
to be extracted and used within the popular bag-of-visual-               Our baseline approach uses the SIFT descriptor extracted in
words (BoVW) approach.                                                   dense manner (with a bin size of 4 and a step size of 8) with
                                                                         K-means used to generate a codebook. A limitation with
1.    INTRODUCTION                                                       SIFT is its high dimensionality, necessitating a large dictio-
                                                                         nary and high memory requirements, and the fact that it
   The Social Event Detection (SED) task at MediaEval 2014
                                                                         ignores colour information. To alleviate this, we propose a
[4] is concerned with the detection and retrieval of events
                                                                         new feature based on super-pixel segmentation. Super-pixel
from large multimedia collections. A key component of this
                                                                         segmentation aims to segment an image into a set of related
social media is image and video data, which typically con-
                                                                         pixels, such that each super-pixel is formed by a set of con-
tains images or videos of the events taking place. However,
                                                                         nected and similar pixels (see Figure 1). We use the SLIC
in previous editions limited attention has been given to this
                                                                         approach of [1] to extract super-pixels, and set the target
data source. For instance in the 2013 evaluation, only two of
                                                                         super-pixel size to 20, to ensure that features are extracted
the approaches sought to incorporate image features and in
                                                                         from a similar size image patch as dense SIFT. From each
both cases they simply applied well established techniques.
                                                                         resultant super-pixel, we extract a set of features to describe
Motivated by this, we seek to investigate the use of visual
                                                                         it’s colour and texture. The colour component is the aver-
features to aid social event detection and clustering.
                                                                         age colour of the super-pixel in LAB colour space divided by
   A limitation of existing widely used approaches such as
                                                                         a normalisation factor, C. The role of C is to ensure that
SIFT [2] is the high dimensionality (32 dimensions), which
                                                                         the colour and texture information contribute approximately
leads to increased memory demands, and the need for large
                                                                         equally to the feature vector, and is set empirically using the
codebooks when used in a BoVW framework. Furthermore,
                                                                         development set. The texture component is a HOG descrip-
descriptors such as SIFT use greyscale images, discarding
                                                                         tor computed from all pixels in the super-pixel. We use an
colour information, and although SIFT descriptors can be
                                                                         8-bin histogram, and do not perform any normalisation prior
computed across multiple channels to incorporate colour,
                                                                         to computing the HOG.
this further increases dimensionality. Motivated by this, we
                                                                            The resultant feature vector for each super-pixel can then
propose a new a low dimensional descriptor that incorpo-
                                                                         be given as:
rates both colour and texture information though the use of
super-pixel segmentation. We combine this approach with                          F = {FL , FA , FB , FHOG,0 , FHOG,1 , FHOG,2 ,
an existing text processing system [5] and evaluate it on sub-                                                                      (1)
                                                                                 FHOG,3 , FHOG,4 , FHOG,5 , FHOG,6 , FHOG,7 },
                                                                         where FL , FA and FB are the LAB colour features; and
Copyright is held by the author/owner(s). MediaEval 2014 Workshop, Oc-   [FHOG,0 ..FHOG,7 ] are the 8 bins of the HOG histogram.
tober 16-17, 2014, Barcelona, Spain                                        We utilise these features within the BoVW framework to
                                                                                 Run       F1        NMI      Div. F1
                                                                                  1      0.7443     0.8993    0.7426
                                                                                  2      0.7525     0.9018    0.7508
                                                                                  3      0.7517     0.9017     0.75
                                                                                  4      0.7523     0.9018    0.7506
                                                                                  5      0.7525     0.9018    0.7509
                                                                     Table 1: Results for the five runs for subtask 1. Re-
                                                                     fer to Section 3.1 run descriptions.
                (a)                            (b)
                                                                     larger codebook used in 3 resulted in overfitting and thus a
Figure 1: An example of super-pixel segmentation                     poorer representation. The use of Fisher Vectors [3] instead
using the SLIC algorithm. Note that larger super-                    of K-means also leads to a small improvement, as can be
pixels are shown here for visualisation purposes.                    seen by the improvement from systems 4 to 5. It should be
                                                                     noted that a Fisher Vector encoding could not be produced
build an image descriptor. A codebook is trained (using K-           for the SIFT features, even with a much smaller dictionary
means or Fisher Vectors [3]) using features extracted from           size, due to the higher dimensionality of the feature and
several thousand images. Subsequent images are then en-              larger memory requirements of the training process.
coded using this codebook to generate a descriptor that en-             We observe that with the exception of system 5, the dense
capsulates the content of the images; and these descriptors          SIFT approach of system 2 outperforms systems using the
are compared to one another using Euclidean distance.                proposed feature (3 and 4). However, the proposed ap-
  Finally, text and visual features are combined in the fol-         proach has a much lower memory footprint than the SIFT
lowing manner:                                                       descriptor (for instance dense SIFT features extracted from
                                                                     the training data require 254GB of storage, while using the
      sim(d, p) = β1 simcosine (d, p) + β2 simtime (d, p)+           proposed approach requires only 10GB), leading to signifi-
                                                               (2)
                      β3 simgps (d, p) + β4 simimage (d, p),         cant improvements in computational efficiency when learn-
                                                                     ing codebooks, and encoding features.
where simcosine (d, p), simtime (d, p) and simgps (d, p) are the
similarity of the text, timestamps and GPS locations as com-         4.   CONCLUSIONS AND FUTURE WORK
puted by [5]; simimage (d, p) is the similarity of the image            We have described our submission to the MediaEval 2014
features; and βi are weight parameters used to combine the           SED task. Our approach uses a new feature representation
different data sources. These weight parameters are learnt           for images, which we utilize with the popular bag-of-words
from the training data to maximise clustering accuracy on            framework. This has been shown to offer comparable per-
the training set. Entries are then clustered using the con-          formance to the SIFT descriptor, at much greater compu-
strained method of [5], which uses document ranking to               tational and memory efficiency. Future work will continue
choose a neighbourhood of best candidates from which the             to investigate the proposed approach. Factors such as the
best match is chosen.                                                normalisation of colour and HOG features, the number of
                                                                     orientation bins, and the size of the super-pixels will all be
3.      EVALUATION                                                   investigated. Furthermore, the method used to combine the
                                                                     visual data with the meta-data will be further investigated
3.1      Runs                                                        and refined to better utilise the visual information.
     Our five systems are as follows:
                                                                     5.   ACKNOWLEDGMENTS
     1. Metadata only: an implementation of [5].
                                                                        We would like to thank Taufik Sutanto and Richi Nayak
     2. Metadata + SIFT/K-means/1000: Meta-data combined             from the ADMRG at QUT for their assistance in completing
        with an image representation using SIFT features and         this evaluation.
        a 1000 word K-means codebook.
     3. Metadata + proposed super-pixel feature (SP)/ K-             6.    REFERENCES
                                                                     [1] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
        means/1000: Meta-data combined with an image rep-                S. Susstrunk. SLIC superpixels compared to state-of-the-art
        resentation using the proposed feature and a 1000 word           superpixel methods. PAMI, 34(11):2274–2282, 2012.
        K-means codebook.                                            [2] D. G. Lowe. Object recognition from local scale-invariant
                                                                         features. In ICCV, volume 2, pages 1150–1157, 1999.
     4. Metadata + SP/K-means/125: As with system 3, ex-             [3] F. Perronnin, J. Sánchez, and T. Mensink. Improving the
        cept the dictionary is now of size 125.                          fisher kernel for large-scale image classification. In ECCV,
     5. Metadata + SP/FV/125: As with system 4, except                   pages 143–156. Springer, 2010.
        Fisher Vector encoding [3] is used instead of K-means.       [4] G. Petkos, S. Papadopoulos, V. Mezaris, and
                                                                         Y. Kompatsiaris. Social event detection at MediaEval 2014:
We use C++ and VLFeat [6] to encode images.                              Challenges, datasets, and evaluation. In Proceedings of the
                                                                         MediaEval 2014 Multimedia Benchmark Workshop, 2014.
3.2      Results                                                     [5] T. Sutanto and R. Nayak. The ranking based constrained
                                                                         document clustering method and its application to social
  Results for subtask 1 are shown in Table 1. We note that               event detection. In Database Systems for Advanced
the incorporation of image data does lead to an improve-                 Applications, pages 47–60. Springer, 2014.
ment, albeit only a small one, over the baseline with systems        [6] A. Vedaldi and B. Fulkerson. VLFeat: An open and portable
2-5 all outperforming the text only system (1). Of note is               library of computer vision algorithms.
that system 4 outperforms that of 3, suggesting that the                 http://www.vlfeat.org/, 2008.