=Paper= {{Paper |id=None |storemode=property |title=Event Clustering and Classification from Social Media: Watershed-based and Kernel Methods |pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_36.pdf |volume=Vol-1043 |dblpUrl=https://dblp.org/rec/conf/mediaeval/NguyenDMSNB13 }} ==Event Clustering and Classification from Social Media: Watershed-based and Kernel Methods== https://ceur-ws.org/Vol-1043/mediaeval2013_submission_36.pdf
     Event Clustering and Classification from Social Media:
             Watershed-based and kernel methods

                      Truc-Vien T. Nguyen                                           Minh-Son Dao
                       University of Lugano                             University of Information Technology
                     6900 Lugano, Switzerland                           Viet-Nam National University HCMC
                thi.truc.vien.nguyen@usi.ch                    sondm@uit.edu.vn
                                  Riccardo Mattivi, Emanuele Sansone,
                                 Francesco G.B De Natale, Giulia Boato
                                              mmLab - University of Trento, Italy
                                                  38123 Povo (TN), Italy
                                {rmattivi, sansone, denatale, boato}@disi.unitn.it

ABSTRACT                                                          tagging, and sharing that event over the network. From this
In this paper, we present the methods for event clustering        progress, two crucial clues are deduced: (1) people cannot
and classification defined by MediaEval 2013. For event           be involved in more than one event at the same time, and (2)
clustering, the watershed-based method with external data         people tend to introduce similar annotations for all images
sources is used. Based on two main observations, the whole        associated to the same event. From these two ideas, a user-
metadata is turned into a user-time (UT) image, so that each      centric data structure, namely UT-image, is introduced for
row of an image contains all records that belong to one user;     storing data in a special structure that can help to exploit
and the records are sorted by time. For event classification,     and explore all observations mentioned above.
we use supervised machine learning and experiment with            The UT-image (user-time image) is a 2D data structure
Support Vector Machines. We present a composite kernel            where row ith contains all time-ordered data related to user
to jointly learn between text and visual features. The meth-      ith (i.e. time taken of UT-image(i, j) is smaller or equal
ods prove robustness with F-measure up to 98% in challenge        to time taken of UT-image(i, j+1)) (for more details please
1, and the composite kernel yields competitive performance        refer to [1]). The user-centric split-and- merge procedure is
across different event types in challenge 2.                      described in algorithm 1.

Categories and Subject Descriptors                                Data: data that need to be clustered
                                                                  Result: set of clusters
H.3 [Information Storage and Retrieval]: Information              1. Translate the original data into UT-image format.
Search and Retrieval                                              2. For each row ith of UT-image do (#Splitting stage)
                                                                  2.1 repeat
                                                                      2.1.1 Split data at column j th if |time-taken-of-UT-image(i, j) -
1.   INTRODUCTION                                                     time-taken-of-UT-image(i, j + 1)| <= time threshold.
This paper describes the social event detection method that       until cannot split anymore;
                                                                  3. repeat
is specially built to meet challenge 1 and 2 of MediaEval             3.1. For each cluster, create time-taken boundary (e.g.
2013 [4]. We also report and discuss the advantages and               [time-start, time-end]), and a union set of not null (longitude,
productivity of the methods based on the result evaluated             latitude), tags, titles, and descriptions, respectively.
                                                                      3.2. For any pair of clusters do MERGING if the following
by MediaEval 2013.                                                    conditions are hold
                                                                      - time-taken boundary intersection does NOT EMPTY or differ
2.   CHALLENGE 1                                                      not MORE THAN time threshold
The method proposed for tackling the challenge 1 of SED               - distance difference between two sets of not null (longitude,
2013 is mostly inherited from the method introduced in [1].           latitude) is SMALLER than distance threshold
                                                                      - Jaccard index of two sets of tags/title/description is LARGER
The idea is based on the basic progress of how an event is            than tag threshold
populated on social networks: (1) the user takes pictures or      until cannot merge anymore;
records videos at the time that event happens; (2) next, the      4. End
                                                                           Algorithm 1: User-centric split-and-merge
user uploads, annotates, and shares his/her media into one
social network; (3) then, his/her friends start commenting,       In order to increase the accuracy of merging stage, “com-
                                                                  mon sense” is taken into account to find the most “common
                                                                  pattern” in the tags field (e.g. the most common word users
                                                                  tend to tag for the same event). Here, TF-IDF method is
                                                                  applied on tags of each cluster to extract the most com-
                                                                  mon keywords. These keywords are used as the main clue
                                                                  to merge clusters. The common sense merging procedure is
                                                                  described in algorithm 2.
Copyright is held by the author/owner(s).
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain    We used algorithm 1 for run 1, algorithm 1 with different
                                                                  parameters for run 2, and both algorithm 1 and 2 for run 3.
Data: set of clusters generated by algorithm 1
Result: set of new clusters
                                                                           3.3    Combine features
1. For each cluster, process TF-IDF on tags set and select the most        In run 5, we used a composite kernel to combine between
common keywords to create a “new common sense tags” set.                   text and visual features CK = α · KT + (1 − α) · KV where
2. For each row ith of UT-image do (#Splitting stage)                      α is a coefficient, KT and KV is either the kernel applied to
2.1 repeat
    2.1. For any two clusters, MERGING if Jaccard index of two             text or visual features. We experimented with α = 0.5.
    “new common sense tags” sets is LARGER then tag threshold.
    2.2. Process TF-IDF on “new common sense tags” set, select the         4.    RESULTS AND CONCLUSIONS
    most common keywords, and update this set.
until cannot merge anymore;                                                The results are reported in tables 1 and 2. In general,
3. End                                                                     the proposed method proves as very competitive whereas
              Algorithm 2: Common sense merging
                                                                           there is still room for improvement. With challenge 1, the
                                                                           watershed-based works well with the results being up to
3.      CHALLENGE 2                                                        98%. For challenge 2, the classification event vs. non-event
                                                                           is acceptable in almost every run, as well as the detection of
To tackle the task event classfication in challenge 2, we use
                                                                           some classes. In the last run, with the composite kernel to
supervised machine learning. We experiment with Support
                                                                           combine between text and visual features, we have 5 classes
Vector Machines, and design a composite kernel to jointly
                                                                           out of 9 above 55%.
learn between text and visual features.
                                                                           Obviously, we have followed the supervised machine learning
3.1       Text features                                                    for challenge 2, so it could not be learnt efficiently with only
                                                            1
The data is processed using GATE platform for tokeniza-                    36 positive instances of the class “fashion”, it may be better
tion, POS tagging and basic word features. We used Sup-                    if we used rule-based instead. Moreover, it is not trivial to
port Vector Machines to train and test our binary classifier.              provide a good detection on the class “other events”, which is
Here, event classification is formulated as a multiclass clas-             a rather undefined class. The combination did the best with
sification problem. The One Vs. Rest strategy is employed                  class “theater dance”. Meanwhile, we also observe that the
by selecting the instance with largest margin as the final                 class “exhibition” has 272 positive instances but could not
answer. For experimentation, we use 5-fold cross-validation                be learnt with any kind of features and should be studied in
with the svm-light tool2 . The feature set for our learning                more detail.
framework is described as follow.
                                                                                            Run       F1     NMI     Div F1
     1. wi is text of the title, description, or the tag in each event                       1       92.34   98.29    87.05
     2. li is the word wi in lower-case                                                      2       93.16   98.48    87.88
                                                                                             3       93.20   98.49    87.93
     3. p1i , p2i , p3i , p4i are the four prefixes of wi
     4. s1i , s2i , s3i , s4i are the four suffixes of wi                              Table 1: Results of Challenge 1
     5. fi is the part-of-speech of wi
                                                                            Run      F1     Div F1     Event/No F1      Event/No Div F1
     6. gi is the orthographic feature that test whether a word contains     1      44.84    33.96         71.30              21.96
        all upper-cased, initial letter upper-cased, all lower-cased.        2      44.95    34.08         71.32              21.96
                                                                             3      36.31    25.31         88.54              39.08
     7. ki is the word form feature that test whether a token is a word,     4      24.30    13.53         87.10              37.61
        a number, a symbol, a punctuation mark.                              5      42.20    31.45         72.18              22.42
     8. oi is the ontological features. We used the ontology and knowl-
        edge base developed in [3], which contains 355 classes, 99 prop-               Table 2: Results of Challenge 2
        erties, and more than 100,000 entities. Given a full ontology,
        wi is be matched to the deepest subsumed child class.

Run 1 was done without external resources, i.e., ontological               5.    REFERENCES
features whereas all the features were used in run 2.                      [1] M.-S. Dao, G. Boato, F. G. De Natale, and T.-V. T.
3.2       Visual features                                                      Nguyen. Jointly exploiting visual and non-visual
For run 3, the image feature extraction was performed in                       information for event-related social media retrieval. In
a similar manner as in [2], and the SVMs, with the same                        Proceedings of the 3rd ACM conference on
settings as in [2], were trained with the data available in                    International conference on multimedia retrieval
the SED training set. Since the training set was unbalanced                    (ICMR), pages 159–166. ACM, 2013.
in the number of samples for each class, mainly towards                    [2] R. Mattivi, J. Uijlings, F. G. De Natale, and N. Sebe.
a higher number of samples from the ’non-event’ type, we                       Exploitation of time constraints for (sub-)event
balanced the training set samples used to train our SVM by                     recognition. In Proceedings of the 2011 joint ACM
reducing the number of samples from the ’non-event’ class.                     workshop on Modeling and representing events, pages
                                                                               7–12, New York, NY, USA, 2011. ACM.
Run 4 used the same approach, but the classification fol-
lowed a two-step classification procedure. Firstly, a classi-              [3] B. Popov, A. Kiryakov, D. Ognyanoff, D. Manov, and
fier was learnt with only ’event’ and ’non-event’ classes, and                 A. Kirilov. KIM a semantic platform for information
secondly another classifier was trained with the remaining                     extraction and retrieval. Natural Language Engineering,
eight classes belonging to the different type of events. Run 3                 10(3-4):375–392, Sept. 2004.
and run 4 did not use time information metadata associated                 [4] T. Reuter, S. Papadopoulos, V. Mezaris, P. Cimiano,
with images.                                                                   C. de Vries, and S. Geva. Social event detection at
                                                                               mediaeval 2013: Challenges, datasets, and evaluation.
1
    http://gate.ac.uk/                                                         In Proceedings of MediaEval 2013, Barcelona, Spain,
2
    http://svmlight.joachims.org/                                              October 2013.