Idiap at MediaEval 2013: Search and Hyperlinking Task

             Chidansh Bhatt             Nikolaos Pappas           Maryam Habibi                     Andrei Popescu-Belis
           Idiap Research Institute Idiap and EPFL                 Idiap and EPFL                   Idiap Research Institute
           Martigny, Switzerland Martigny, Switzerland           Martigny, Switzerland              Martigny, Switzerland
            cbhatt@idiap.ch            npappas@idiap.ch mhabibi@idiap.ch                             apbelis@idiap.ch


ABSTRACT                                                          hyperlinking, we first rank all segments based on similar-
The Idiap system for Search and Hyperlinking Task uses            ity with the anchor. In addition, we use the visual concept
topic-based segmentation, content-based recommendation al-        detection provided by the organizers (key frames from Tech-
gorithms, and multimodal re-ranking. For both sub-tasks,          nicolor[5], concepts detected by Visor[2]) to generate a score
our system performs better with automatic speech recogni-         matrix and then the list of nearest neighbors. Scores from
tion output than with manual subtitles. For linking, the          text and visual similarity are fused to re-rank final linking
results benefit from the fusion of text and visual concepts       results.
detected in the anchors.
                                                                               Transcripts:               Meta-data:              Visual concept
                                                                         LIMSI / LIUM / subtitles     cast, synopsis, title       detection data
Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content                            Topic-based
                                                                                                                    Segment level indexing:
                                                                             segmentation:
Analysis and Indexing; H.5.1 [Information Interfaces and                       TextTiling
                                                                                                                    visual concept detection
Presentation]: Multimedia Information Systems

                                                                        Similarity score generation based       Similarity score generation based
Keywords                                                                on word vector spaces and TF-IDF           on visual features and KNN

Topic segmentation; video search; video hyperlinking.
                                                                            Top N                                                       Top K
                                                                                           W        Re-ranking           1-W
1.   INTRODUCTION                                                          results                                                     results

  This paper outlines the Idiap system for the MediaEval
                                                                                                       Top N
2013 Search and Hyperlinking Task [3]. The search sub-task                                            results
required finding a determined segment of a show (from 1260
hours of broadcast TV material provided by BBC) based on                Figure 1: Overview of the Idiap system.
a query that had been built with this “known item” in mind.
The hyperlinking sub-task required finding items from the            Topic segmentation was performed over subtitles or tran-
collection that are related to “anchors” from known items.        scripts using TextTiling as implemented in the NLTK toolkit.
We propose a unified approach to both sub-tasks, based on         Topic shifts are based on the analysis of lexical co-occurrence
techniques inspired from content-based recommender sys-           patterns, computed from 20-word pseudo-sentences. (This
tems [6], which provide the most similar segments to a given      value was chosen to satisfy the requirement of the hyper-
text query or to another segment, based on words. For hy-         linking task that segments are on average shorter than 2
perlinking, we also use the visual concepts detected in the       minutes.) Then, similarity scores are assigned at sentence
anchor in order to rerank answers based on visual similarity.     gaps using block comparison. The peak differences between
                                                                  the scores are marked as boundaries, which we fit to the clos-
                                                                  est speech segment break. The total number of segments for
2.   SYSTEM OVERVIEW                                              subtitles / LIMSI / LIUM is respectively 114,448 / 111,666
  The Idiap system makes use of three main components,            / 84,783, with average segment sizes of 53 / 53 / 68 seconds
shown at the center of Fig. 1. We generate the data units,        and a standard deviation of 287 / 68 / 64 seconds. We found
namely topic-based segments, from the subtitles or the ASR        some mismatches between the durations in metadata files
transcripts (either from LIMSI/Vocapia[4] or from LIUM[7])        and the timing found in the subtitle or LIMSI transcript
using TextTiling in NLTK [1]. For search, we compute word-        files and we discarded such mismatching segments (there
based similarity (from transcript and metadata) between           are respectively 488 and 956 such mismatches). For in-
queries and all segments in the collection, using a vector        stance, “20080510 212500 bbcthree two pints of lager and”
space model based and TF-IDF weighting. Similarly, for            has a duration of 1,800 seconds according to the meta-data
                                                                  file, while the last subtitle segment ends at 00:55:26.2 and
                                                                  the last segment of the LIMSI transcript ends at 3325.36.
Copyright is held by the author/owner(s).                            Segment search was performed by indexing the text seg-
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain    ments in a word vector space with TF-IDF weights, rep-
resenting each textual query (and words from the “visual                 Submission        MRR      mGAP      MASP
cues”) into the same space, and retrieving the most similar              Subtitles         0.064    0.044     0.044
segments to the query using cosine similarity. We first tok-             LIUM + Meta       0.085    0.054     0.053
enized the text and removed stop words. We tested several                LIUM              0.090    0.058     0.057
parameters on the small development set with the LIMSI                   LIMSI             0.110    0.060     0.060
transcript: the order of n-grams (1, 2, or 3) and the size of
the vocabulary (10k, 20k, 30k, 40k, 50k words). The best         Table 1: Official Idiap results for the search task.
scores (ranks of known items in the results) were reached
for 50k words with unigrams, bigrams and trigrams. With         line, separately from the other submissions, due to a time
these features, we found on the development set that the        conversion problem undetected on submission. Here also,
LIMSI transcript performed best, followed by LIUM, LIUM         using the LIMSI transcript (first line) outperforms subtitles.
with metadata, and subtitles. We submitted 4 runs for the       This might be due to the higher weight of visual concepts
search sub-task: 3 were based on each transcript/subtitle       when using transcripts (0.4) vs. subtitles (0.2).
words, and the fourth used the LIUM transcript but ap-             When using subtitles (2nd and 3rd rows), a higher MAP
pended to each segment the words from the metadata (cast,       value was found when context was used, indicating that this
synopsis, series, and episode name).                            might actually add useful information, esp. with our strat-
   For hyperlinking segments from anchors, indexing is          egy of extending context boundaries to the closest segments.
performed as above, though using only unigrams and a vo-        Therefore, we hypothesize that using LIMSI transcripts for
cabulary of 20,000 words. For scenario A (anchor informa-       the A task would lead to an even lower MAP compared to
tion only), we extended the anchor text with text from seg-     the LIMSI transcripts for the C task.
ments containing/overlapping the anchor boundaries. For            The precision of our system increases from top 5 to top
the scenario C, we considered the text within the start time    10 and decrease a bit at top 20. Our best system reaches
and end time of the provided know-item, along with text         close-to-average MAP on anchors 31 and 39 (respectively
from segments containing/overlapping the know-item bound-       0.80 and 0.50), while the MRR of the corresponding search
aries. We enriched the subtitle/ASR text using the textual      queries (item 23 for 31, item 25 for 39) is close to zero. This
metadata (title, series, episode) and webdata (cast, synop-     is an indication that the visual features may be helpful.
sis). The segments and anchors were indexed into a vector
space with TF-IDF weights, and the top N most similar seg-           Submission           P5       P 10     P 20     MAP
ments were found by cosine similarity.                               I V M O T6V4 C       0.620    0.583    0.413    1.00
   Then, we reranked results based on visual feature similar-        S V M O T8V2 C       0.400    0.443    0.370    0.832
ity, using the visual concept detection scores per keyframe          S V M O T8V2 A       0.400    0.433    0.340    0.782
(provided by the organizers). Keyframes were first aligned
to topic-based segments using shot information [5], with an     Table 2: Idiap results for hyperlinking: precision at
average of 5 keyframes per segment. Similarly, this was per-    top 5, 10 and 20, and mean average precision.
formed for the anchors (8 frames) and anchors + contexts
(55 frames). For each segment, we generated a visual fea-       4.    ACKNOWLEDGMENTS
ture vector using the concepts with the highest scores from       This work was supported by the Swiss National Science
the keyframes of the segment. Using KNN, we ranked all          Foundation (AROLES project n. 51NF40-144627) and by
segments by decreasing similarity to an anchor. Then, we        the European Union (inEvent project FP7-ICT n. 287872).
reranked text-based results using visual information, respec-   We would like to thank Maria Eskevich and Robin Aly for
tively with weight W (for text) and 1 − W (for visual). We      their valuable help with the task.
chose W = 0.8 in the case of subtitles (assuming a higher
accuracy) and W = 0.6 for transcripts. Finally, we ignored
segments shorter than 10 s and chunked larger segments
                                                                5.    REFERENCES
into 2-minute segments. We submitted 3 runs: two with the       [1] S. Bird. NLTK: the Natural Language Toolkit. In
subtitle words (scenarios A and C) and one with the LIMSI           COLING/ACL Interactive Presentations, Sydney, 2006.
transcript (C).                                                 [2] K. Chatfield and al. The devil is in the details: an
                                                                    evaluation of recent feature encoding methods. In
3.   RESULTS                                                        British Machine Vision Conference, 2011.
                                                                [3] M. Eskevich, G. J. Jones, S. Chen, R. Aly, and
  The official search results (Table 1) show the same rank-
                                                                    R. Ordelman. The Search and Hyperlinking Task at
ing as on the development set. Using LIMSI transcript out-
                                                                    MediaEval 2013. In MediaEval 2013 Workshop,
performs the LIUM one, which is not helped by metadata
                                                                    Barcelona, Spain, October 18-19 2013.
(this might be due to low-frequency features in the meta-
                                                                [4] J.-L. Gauvain and al. The LIMSI Broadcast News
data). Surprisingly, subtitles yield the lowest scores.
                                                                    Transcription System. Speech Communication, 2002.
  The overall low scores (esp. on mGAP and MASP) could
be due to the short average size of our segments, which were    [5] A. Massoudi and al. A video fingerprint based on visual
not calibrated to match the average size of known items.            digest and local fingerprints. In ICIP, 2006.
  Analyzing results per query, in 12 out of 50 test queries     [6] N. Pappas and A. Popescu-Belis. Combining content
our best run gets the known item in the top 10 answers.             with user preferences for ted lecture recommendation.
These queries are not “easy”, as they vary across runs (with        In Content Based Multimedia Indexing (CBMI), 2013.
exceptions like item 18). On the contrary, for 14 queries the   [7] H. Schwenk and al. LIUM’s SMT machine translation
known-item is not found among the top 1000 results.                 systems for WMT 2011. In 6th Workshop on Statistical
  The linking runs (Table 2) were scored after the dead-            Machine Translation, pages 464–469, Edinburgh, 2011.