IRISA and KUL at MediaEval 2014: Search and
                           Hyperlinking Task

 Anca-Roxana Şimon* , Guillaume Gravier** ,                                    Marie-Francine Moens
           Pascale Sébillot***                                                         KU Leuven
                  IRISA & INRIA Rennes                                             Celestijnenlaan 200A
            Univ. Rennes 1* , CNRS** , INSA***                                   B-3001 Heverlee, Belgium
              35042 Rennes Cedex, France                                      sien.moens@cs.kuleuven.be
             firstname.lastname@irisa.fr

ABSTRACT                                                                 of hyperlinking which, to our knowledge, has not been ad-
This paper presents our approach and results in the hyper-               dressed before.
linking sub-task at MediaEval 2014. A two step approach is               2.    SYSTEM OVERVIEW
implemented: relying on a topic segmentation technique,
                                                                           The aim of our approach is to find target segments of the
the first step consists in generating potential target seg-
                                                                         same topic as the anchor, or of related topics. The hyperlink
ments; then, for each anchor, the best 20 target segments
                                                                         generation relies on content-based comparisons exploiting
are selected according to two distinct strategies: the first
                                                                         spoken data obtained from automatic transcripts and man-
one focuses on the identification of very similar targets us-
                                                                         ual subtitles [4]. Data are first lemmatized and only nouns,
ing n-grams and named entities; the second one makes use
                                                                         non modal verbs and adjectives are kept. Subsections 2.1
of an intermediate structure built from topic models, which
                                                                         and 2.2 respectively detail the two parts of our system: the
offers the possibility to control serendipity and to explain
                                                                         generation of potential target segments and the selection of
the links created.
                                                                         the top 20 targets for each anchor.
1.    INTRODUCTION                                                       2.1    Generating potential target segments
   This paper presents the joint participation of IRISA and                Each video in the test collection is partitioned into topi-
KUL at the MediaEval 2014 Search and Hyperlinking task,                  cally coherent segments with the generic topic segmentation
in which focus is set on the hyperlinking sub-task [2]. The              algorithm TextSeg [7], which is domain independent, needs
goal is thus to create hyperlinks between predefined anchor              no a priori information, is efficient on speech transcripts and
segments and short video segments, called targets, which                 on segments of varying lengths. Its main drawback concerns
should offer complementary information not found at search               over-segmentation, which in our case is not problematic since
time. Targets have to be automatically extracted from videos             the target segments must not last longer than 2 minutes.
of a large collection.
   The hyperlinking system we propose consists in a two step             2.2    Selection of hyperlinks targets
process: first, all potential target segments are extracted us-             Each anchor segment is compared with each topically co-
ing a topic segmentation technique; then, the most relevant              herent segment previously obtained thanks to similarity mea-
targets are selected for each anchor, with the help of content           sures. The comparison can be direct or indirect (i.e., using
analysis and similarity measures.                                        intermediate structures).
   In our 2013 participation [5], we focused on precise target              Four methods are proposed to select the hyperlinks tar-
selection and our most efficient system consisted in a direct            gets. The baseline corresponds to the method for which we
comparison between anchor and target segments obtained                   obtained our best results in 2013 (Linear+ngrams); a direct
through topic segmentation using bags of n-grams to rep-                 comparison between segments is done, contents being repre-
resent content, thus resulting in links between anchor and               sented by bags of unigrams, bigrams and trigrams. The sec-
very similar targets. We thus go on exploring this direction.            ond method extends the previous one by using the Stanford
However, we believe that besides precise target selection, a             Named Entity Recognizer (NER) [3], a 3 class (person, orga-
very important aspect of hyperlinking is to offer serendip-              nization and location) entity tagger (Linear+ngrams+NEs).
ity and to explain why two video segments are linked. To                 A Jaccard similarity coefficient is used to evaluate similari-
address these points, we propose an intermediate structure,              ties in terms of shared named entities (NEs). The n-grams
obtained from topic models, that allows an indirect compar-              and NEs similarity scores are combined; weights of 0.3 and
ison between anchors and targets. Its first advantage is that            0.7 respectively are chosen to favor precise alignments, thus
segments which do not share a consistent part of the vocab-              not rewarding serendipity.
ulary but are semantically related can be linked. Moreover                  For the last two methods an intermediate structure is built
this structure provides a basis to investigate why certain               using Latent Dirichlet Allocation (LDA) probabilistic topic
links are created. Link justification is an interesting aspect           models [1] learned on the manual transcripts of the devel-
                                                                         opment set (1335 hours of video). Each transcript is rep-
                                                                         resented as a mixture of K latent topics, where a latent
Copyright is held by the author/owner(s). MediaEval 2014 Workshop, Oc-   topic is represented as a probability distribution over the
tober 16-17, 2014, Barcelona, Spain                                      vocabulary. Contrary to the bag of words representation,
this one clusters semantically similar co-occurring terms.                                         P5    P 10 P 20
To construct the structure LDA is trained using Gibbs sam-           Linear+ngrams (LIMSI)          0.1 0.096 0.06
pling, standard values for the hyperparameters α=50/K and            Linear+ngrams (MANUAL)        0.19  0.16    0.1
β=0.01 [6] with a varying number of latent topics K: 50,             Linear+ngrams+NEs (LIMSI) 0.046 0.036 0.02
100, 150, 200, 300, 500, 700. This range for the number of           TopicM (LIMSI)               0.093 0.093 0.048
topics was chosen to learn general to more specific topics.          HierTopicM (LIMSI)            0.03  0.02   0.01
Using this structure an indirect comparison between anchor         Table 1: Precision values obtained for all proposed
and target segments can be performed.                              methods on the 2014 test set.
   Our third method (TopicM ) consists in computing, for ev-
ery K, the probabilities of the anchor and target segments
                                                                   transcripts the precision decreases. Unexpectedly, adding
given the topics. For each K and for each anchor-target pair,
                                                                   NEs to this method and therefore favoring segments about
two vectors are obtained in which componenti corresponds
                                                                   the same people, places, organizations, diminished the pre-
to the probability of topici , given the words contained in the
                                                                   cision. From a manual assessment of several target segments
segment. Then a similarity measure is computed between
                                                                   proposed using NEs, it seems that having targets speaking
these vectors, leading to a score for each anchor-target pair.
                                                                   about same people (e.g., Madonna, Beckham) in different
To select the top targets, for each anchor a linear combi-
                                                                   circumstances (i.e, shows on different subjects: diets, char-
nation of the scores resulted for each K is done with more
                                                                   ity) is not relevant. Possibly, giving less weight to the NEs
importance to the most specific topics.
                                                                   shared between anchors and targets, the precision could be
   For the last method (HierTopicM ), the previous structure
                                                                   improved.
with topic models is extended to form a tree-like hierarchy
                                                                      Regarding the topic model-based methods, the one that
between the topics. This hierarchy relies on a similarity
                                                                   uses all the topics learned (TopicM ) yields better results,
measure between the topics obtained with Ki and Kj , where
                                                                   comparable to those obtained with the Linear+ngrams method.
Kj < Ki . Thus, each topic learned with K = 700 will be
                                                                   From a manual assessment of the results it seems that the
connected to the most similar topic learned with K = 500
                                                                   topics, even those learned with K = 700 are too general.
and so on for the other K values. Having this representation
                                                                   The targets that were considered relevant are those for which
a path for each anchor can be selected, starting with the
                                                                   the anchor addresses a more general topic (e.g., wildlife).
bottom of the tree (K = 700) and choosing the topic with
                                                                   The problem of generality appears also for the HierTopicM
the highest probability given the words in the anchor and
                                                                   method. However having only a part of the topics consid-
going on with the selection of the parents for that topic until
                                                                   ered the results are worse than with TopicM. Using such
the first level in the tree (K = 50) is reached. The targets
                                                                   intermediate structures could be improved by learning more
are then selected by a linear combination of the probability
                                                                   specific topics. Still, having the topwords of the topics that
values of the topics in an anchor path, given the words in
                                                                   best explain the anchors and the targets can help interpret
the target. Thus, in the end only a part of the topics in the
                                                                   and justify the links. Additionally, with an intermediate
structure will be considered, allowing more precise control
                                                                   structure, anchors are linked to targets even without shar-
of serendipity and justification of links.
                                                                   ing much vocabulary.
   Using the structure of topics (hierarchical or not) to select
the hyperlinks, serendipity can be controlled by giving more
weights to values attained with more general or more specific
                                                                   4.   REFERENCES
topics. Moreover links can be justified, by looking at the top     [1] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent
words of the topics that contributed most in the selection of          dirichlet allocation. Journal of Machine Learning
those links. This structure can be used a posteriori for other         Research, 3:993–1022, 2003.
methods to understand the links creation.                          [2] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman,
   Considering the evaluation rules, the target segments pro-          S. Chen, and G. J. F. Jones. The Search and
posed for each anchor should have a duration between 10 sec            Hyperlinking task at MediaEval 2014. In Working notes
and 2 min. Therefore, a post-processing of the final targets           of the MediaEval 2014 Workshop, Barcelona, Spain,
is done to verify these constraints. If a segment is longer            2014.
than 2 min, it is resegmented with a sliding window of 2           [3] J. R. Finkel, T. Grenager, and C. D. Manning.
min and the best scoring window within the segment will                Incorporating non-local information into information
represent the final target segment. If a segment is shorter            extraction systems by Gibbs sampling. In Association
than 10 sec it is combined with the best scoring neighbor,             for Computational Linguistics, Ann Arbor, USA, 2005.
resulted from the topic segmentation, until the minimum            [4] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI
length is reached.                                                     broadcast news transcription system. Speech
                                                                       Communication, 37(1-2):89–108, 2002.
3.   RESULTS                                                       [5] C. Guinaudeau, A.-R. Şimon, G. Gravier, and
   Official evaluation results are reported in Table 1. Only           P. Sébillot. HITS and IRISA at MediaEval 2013: Search
the top 10 results were evaluated for each method, via crowd-          and hyperlinking task. In Working Notes Proceedings of
sourcing on Amazon Mechanical Turk (AMT). Other results                the MediaEval Workshop, Barcelona, Spain, 2013.
were evaluated automatically based on the evaluation of the        [6] M. Steyvers and T. Griffiths. Probabilistic Topic
selected results from all participants. The best results are           Models. Handbook of Latent Semantic Analysis,
obtained with the topic segmentation with ngrams on the                427(7):424–440, 2007.
manual subtitles (Linear+ngrams). This means that turkers          [7] M. Utiyama and H. Isahara. A statistical model for
prefer that the content of the target segments is closely re-          domain-independent text segmentation. In Association
lated to that of the anchor. As anticipated, on the automatic          for Computational Linguistics, Toulouse, France, 2001.