Mediaeval benchmark: Social Event Detection using LDA
                 and external resources

                        Mohamed Morchid                                                     Georges Linarès
              Laboratoire d’Informatique d’Avignon                               Laboratoire d’Informatique d’Avignon
                     LIA, Avignon, France                                               LIA, Avignon, France
       mohamed.morchid@etd.univ-avignon.fr                                      georges.linares@univ-avignon.fr


ABSTRACT                                                                  2.1    Subset extraction
This article presents two methods for the automatic detection of             The proposed extraction method relies first on creating a corpus
social events that were evaluated on the annotated set of pictures        from the web. This corpus is obtained by querying google with the
as part of the 2011 Mediaeval benchmark [1]. The first method             keywords of the challenge query. When we collect this corpus, we
uses a set of web pages and a semantic space obtained by Latent           create a query-dependent feature vector to evaluate picture/event
Dirichelet Allocation (LDA, [2, 3]) to classify pictures from Flickr.     category similarites. Then, we select the nearest pictures according
The second approach uses the query to extract a subset of pictures        to a fixed threshold.
and classify this subset. Theses approches are compared in the ex-           → Query representation
perimental framework of Mediaeval 2011 Social Event Detection             The event category is represented by a feature vector obtained by
task .                                                                    analysing the related web pages. Web pages are collected by send-
                                                                          ing query 1 to google and to select the 100 best documents. The
Categories and Subject Descriptors                                        query depedent feature vector v is composed by the relative fre-
                                                                          quency (p(w|v)) of each words w of the corpus, divided by the av-
H.3 [Information Search and Retrieval]: Indexing                          erage position of the first occurence of the word positionw in the
                                                                          returned documents. A stop-list based filtering process takes off the
General Terms                                                             meaningless words. The last feature of the vector is the number of
LDA, picture categorization                                               seconds since the first january 1970 until the picture taken date.
                                                                          → Distance between pictures and query
                                                                          We want to select the pictures that are related to the query. This
Keywords                                                                  is achieved by calculating the distance between the picture and the
LDA, Benchmark, Mediaeval, Classification, Event Detection                feature vector by:

1.    INTRODUCTION                                                               Dist(pictk , v) =
                                                                                                      X
                                                                                                               v(w) , v(w) =
                                                                                                                                 T Fw
                                                                                                                                            (1)
   The search and the browsing of picture collections from sharing                                   w∈pictk
                                                                                                                               positionw
platforms requires automatic processing of both content and meta-
data that are provided by users or owners. Social event detection           We create a subset with pictures those distance to the features
consists in finding, in a large collection of photos, the ones that       vector exceeds a fixed threshold.
are related to a specific event. Our system performs two steps that
consist first in extracting all the pictures related to the event cate-   2.2    Subset clustering
gory (i.e. soccer, Barcelona and Roma), and then to select pictures          We have to cluster this subset into parts that are supposed to be
related to specific events. In the following, the category extrac-        related to social events belonging to the same category. This clus-
tion step is named subset extraction, the by-event clustering being       tering step is achieved into the LDA space.
named subset clustering.
   We tested a method based on a semantic representation of pic-            → Semantic Space with LDA
tures by LDA, that is compared to a simple clustering method.
These 2 methods are described respectivelly in the Section 2 and            For each challenge, we just have the words in the query to find
3 of this paper.                                                          the events on the pictures set. Here, we locate the query in a topic
                                                                          space estimated by LDA. The 50-topic LDA model is estimated on
2.    FIRST RUN: LDA-BASED CONTENT REP-                                   the dataset obtained for the Subset Extraction step.
      RESENTATION
   This method relies on an intermediate representation of pictures          → Vector of distance
in a semantic space obtained by LDA. In order to estimate the LDA         We calculate, for each picture, a vector of distance with all top-
model, we collect text materials from the Web by using queries            ics. We add to this vector another feature: the number of seconds
related to the event category, represented by a set of keywords.          between the date of the picture (dateTaken field) and the 01 jan.
                                                                          1970.

                                                                          1
Copyright is held by the author/owner(s).                                   C1:soccer barcelona rome
MediaEval 2011 Workshop, September 1-2, 2011, Pisa, Italy                 C2:may 2009 parc del forum barcelona paradiso amsterdam
                                     X                                   of the challenge to estimate picture/cluster similarities. Evaluation
               Dist(pictk , tj ) =       p(w|tj ).p(tj |C)           (2) probably presents technical problem that remains to be clearly un-
                                  w∈ictk                                 derstood.
                                                                            Nevertheless, the results show that high level approach such rep-
  We use the prior probability p(tj |C) of a topic tj in the corpus
                                                                         resentation in a semantic space doesn’t perform well, probably due
C to weight the distance of a picture with a (un)relevant topic.
                                                                         to it’s complexity and the various possiblity of adding noise at dif-
→ Clustering
                                                                         ferent level of the processing chain (in data collecting, topic mod-
Selected picture set is clustered by using the Expectation-Maximisation,
                                                                         eling, document representation in the topic space, etc.).
gaussian-based clustering algorithm [4].

                                                                           6.    REFERENCES
3.      SECOND RUN: PROPOSED APPROACH
                                                                           [1] S. Papadopoulos, R. Troncy, V. Mezaris, B. Huet, and
   In this run, we use only the information of pictures and the query
                                                                               I. Kompatsiaris, “Social Event Detection at MediaEval 2011:
for the two challenges. These information are constituted by all
                                                                               Challenges, Dataset and Evaluation,” in MediaEval 2011
textual metadata available. The global processing scheme is similar
                                                                               Workshop, Pisa, Italy, September 1-2 2011.
to the one we used for LDA based approach: a first step select a
                                                                           [2] A. McCallum, “Mallet: A machine learning for language
subset of relevant pictures that are clustered in a second step.
                                                                               toolkit,” 2002.
   Here, similarities are only based on text-level comparison, the
query being represented by it’s keywords and the pictures being            [3] L. Rigouste, O. Cappé, and F. Yvon, “Quelques observations
represented by title, description and taggs (if available).                    sur le modele lda,” Actes des IXe JADT, pp. 819–830, 2006.
Section 3.1 presents the method to extract a subset of pictures re-        [4] T. Moon, “The expectation-maximization algorithm,” Signal
lated to the query and Section 3.2 describes the clustering method.            Processing Magazine, IEEE, vol. 13, no. 6, pp. 47 –60, nov
                                                                               1996.
3.1      Subset Extraction                                                 [5] C. Goutte and E. Gaussier, “A probabilistic interpretation of
   This first selection step relies on an estimation of proximity of a         precision, recall and F-score, with implication for evaluation,”
picture to the targeted query. We count the number of occurences               Advances in Information Retrieval, pp. 345–359, 2005.
of each word of the query in the picture text materials. A specific
weighting is applied according to the field in which a word occurs.
If the word appears in the title, the weight is 1.0, in the description,
0.75 and 0.25 for a tag.
   For each element of the picture, we calculate the f-score[5] be-
tween words in query and words in pictures features. If the f-score
of a section exceeds a threshold (0.75 seems to give the best result),
we apply another boosting rule by multiplying the score of the el-
ement by 100. We add to the subset the pictures with a score over
40% of the highest score.

3.2      Subset clustering
     We determine the similarity between each pair of pictures:
                                            Nj,k
                   Sim(pictj , pictk ) =
                                           Nj + Nk
Where j 6= k and Nj is the total number of words in pictj . Nj,k
is the number of words that belongs to a element of pictj AND a
element of pictk . The system puts each picture with the cluster that
contains the picture of highest similarity.

4.      EXPERIMENTS
   For this task, we used 73269 pictures from Flickr [1]. Each pic-
ture is associated to a title, a description, owner nickname and tags.              Table 1: Results each challenges and each runs
   → Results
                                                                                             Challenge I                   Challenge II
We present in the Table 1 the results in each challenge where E is
                                                                                      E     PA      PR       %      E      PA     PR        %
the number of events detected for this challenge, PA is the number
                                                                            Run 1     10   1223 72046        1,6    9     1373 71896        1,8
of pictures accepted for the challenge, PR the number of rejected
                                                                            Run 2      3    13    73256       0     20     65    73204       0
pictures for the challenge and % show the percentage of accepted
pictures. In Table 2 we present our evaluation (as evaluated by Me-
diaEval Benchmark organisers). Measures are Normalized Mutual
Information (NMI) and F-Score.                                                  Table 2: Evaluation for each challenges and each runs
                                                                                               Challenge I         Challenge II
5.      CONCLUSION                                                                          F-Score     NMI     F-Score     NMI
   We proposed two methods to cluster a set of pictures from Flickr.               Run 1     10,13     0.0263     12,44     -0.01
In the first run, we use LDA and the Web pages to cluster pictures                 Run 2      Un.        Un.       3.53    0.0253
with topics. We also propose a second method that use the query