<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Idiap at MediaEval 2013: Search and Hyperlinking Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chidansh Bhatt</string-name>
          <email>cbhatt@idiap.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolaos Pappas Maryam Habibi</string-name>
          <email>mhabibi@idiap.ch</email>
          <email>npappas@idiap.ch</email>
          <email>npappas@idiap.ch mhabibi@idiap.ch</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrei Popescu-Belis</string-name>
          <email>apbelis@idiap.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Idiap Research Institute</institution>
          ,
          <addr-line>Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Idiap and EPFL Idiap and EPFL</institution>
          ,
          <addr-line>Martigny, Switzerland Martigny</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>The Idiap system for Search and Hyperlinking Task uses topic-based segmentation, content-based recommendation algorithms, and multimodal re-ranking. For both sub-tasks, our system performs better with automatic speech recognition output than with manual subtitles. For linking, the results bene t from the fusion of text and visual concepts detected in the anchors. Topic segmentation; video search; video hyperlinking.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        This paper outlines the Idiap system for the MediaEval
2013 Search and Hyperlinking Task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The search sub-task
required nding a determined segment of a show (from 1260
hours of broadcast TV material provided by BBC) based on
a query that had been built with this \known item" in mind.
The hyperlinking sub-task required nding items from the
collection that are related to \anchors" from known items.
We propose a uni ed approach to both sub-tasks, based on
techniques inspired from content-based recommender
systems [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which provide the most similar segments to a given
text query or to another segment, based on words. For
hyperlinking, we also use the visual concepts detected in the
anchor in order to rerank answers based on visual similarity.
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM OVERVIEW</title>
      <p>
        The Idiap system makes use of three main components,
shown at the center of Fig. 1. We generate the data units,
namely topic-based segments, from the subtitles or the ASR
transcripts (either from LIMSI/Vocapia[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or from LIUM[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ])
using TextTiling in NLTK [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For search, we compute
wordbased similarity (from transcript and metadata) between
queries and all segments in the collection, using a vector
space model based and TF-IDF weighting. Similarly, for
Similarity score generation based
on word vector spaces and TF-IDF
      </p>
      <p>Similarity score generation based</p>
      <p>on visual features and KNN
W</p>
      <p>Re-ranking</p>
      <p>1-W
Top N
results</p>
      <p>Top K
results</p>
      <p>Topic segmentation was performed over subtitles or
transcripts using TextTiling as implemented in the NLTK toolkit.
Topic shifts are based on the analysis of lexical co-occurrence
patterns, computed from 20-word pseudo-sentences. (This
value was chosen to satisfy the requirement of the
hyperlinking task that segments are on average shorter than 2
minutes.) Then, similarity scores are assigned at sentence
gaps using block comparison. The peak di erences between
the scores are marked as boundaries, which we t to the
closest speech segment break. The total number of segments for
subtitles / LIMSI / LIUM is respectively 114,448 / 111,666
/ 84,783, with average segment sizes of 53 / 53 / 68 seconds
and a standard deviation of 287 / 68 / 64 seconds. We found
some mismatches between the durations in metadata les
and the timing found in the subtitle or LIMSI transcript
les and we discarded such mismatching segments (there
are respectively 488 and 956 such mismatches). For
instance, \20080510 212500 bbcthree two pints of lager and"
has a duration of 1,800 seconds according to the meta-data
le, while the last subtitle segment ends at 00:55:26.2 and
the last segment of the LIMSI transcript ends at 3325.36.</p>
      <p>
        Segment search was performed by indexing the text
segments in a word vector space with TF-IDF weights,
rephyperlinking, we rst rank all segments based on
similarity with the anchor. In addition, we use the visual concept
detection provided by the organizers (key frames from
Technicolor[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], concepts detected by Visor[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) to generate a score
matrix and then the list of nearest neighbors. Scores from
text and visual similarity are fused to re-rank nal linking
results.
      </p>
      <p>Transcripts:
LIMSI / LIUM / subtitles</p>
      <p>Meta-data:
cast, synopsis, title</p>
      <p>Visual concept
detection data
Segment level indexing:
visual concept detection
Topic-based
segmentation:</p>
      <p>TextTiling
Top N
results
resenting each textual query (and words from the \visual
cues") into the same space, and retrieving the most similar
segments to the query using cosine similarity. We rst
tokenized the text and removed stop words. We tested several
parameters on the small development set with the LIMSI
transcript: the order of n-grams (1, 2, or 3) and the size of
the vocabulary (10k, 20k, 30k, 40k, 50k words). The best
scores (ranks of known items in the results) were reached
for 50k words with unigrams, bigrams and trigrams. With
these features, we found on the development set that the
LIMSI transcript performed best, followed by LIUM, LIUM
with metadata, and subtitles. We submitted 4 runs for the
search sub-task: 3 were based on each transcript/subtitle
words, and the fourth used the LIUM transcript but
appended to each segment the words from the metadata (cast,
synopsis, series, and episode name).</p>
      <p>For hyperlinking segments from anchors, indexing is
performed as above, though using only unigrams and a
vocabulary of 20,000 words. For scenario A (anchor
information only), we extended the anchor text with text from
segments containing/overlapping the anchor boundaries. For
the scenario C, we considered the text within the start time
and end time of the provided know-item, along with text
from segments containing/overlapping the know-item
boundaries. We enriched the subtitle/ASR text using the textual
metadata (title, series, episode) and webdata (cast,
synopsis). The segments and anchors were indexed into a vector
space with TF-IDF weights, and the top N most similar
segments were found by cosine similarity.</p>
      <p>
        Then, we reranked results based on visual feature
similarity, using the visual concept detection scores per keyframe
(provided by the organizers). Keyframes were rst aligned
to topic-based segments using shot information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], with an
average of 5 keyframes per segment. Similarly, this was
performed for the anchors (8 frames) and anchors + contexts
(55 frames). For each segment, we generated a visual
feature vector using the concepts with the highest scores from
the keyframes of the segment. Using KNN, we ranked all
segments by decreasing similarity to an anchor. Then, we
reranked text-based results using visual information,
respectively with weight W (for text) and 1 W (for visual). We
chose W = 0.8 in the case of subtitles (assuming a higher
accuracy) and W = 0.6 for transcripts. Finally, we ignored
segments shorter than 10 s and chunked larger segments
into 2-minute segments. We submitted 3 runs: two with the
subtitle words (scenarios A and C) and one with the LIMSI
transcript (C).
      </p>
    </sec>
    <sec id="sec-3">
      <title>RESULTS</title>
      <p>The o cial search results (Table 1) show the same
ranking as on the development set. Using LIMSI transcript
outperforms the LIUM one, which is not helped by metadata
(this might be due to low-frequency features in the
metadata). Surprisingly, subtitles yield the lowest scores.</p>
      <p>The overall low scores (esp. on mGAP and MASP) could
be due to the short average size of our segments, which were
not calibrated to match the average size of known items.</p>
      <p>Analyzing results per query, in 12 out of 50 test queries
our best run gets the known item in the top 10 answers.
These queries are not \easy", as they vary across runs (with
exceptions like item 18). On the contrary, for 14 queries the
known-item is not found among the top 1000 results.</p>
      <p>The linking runs (Table 2) were scored after the
dead</p>
      <sec id="sec-3-1">
        <title>Submission Subtitles LIUM + Meta LIUM</title>
        <p>LIMSI
line, separately from the other submissions, due to a time
conversion problem undetected on submission. Here also,
using the LIMSI transcript ( rst line) outperforms subtitles.
This might be due to the higher weight of visual concepts
when using transcripts (0.4) vs. subtitles (0.2).</p>
        <p>When using subtitles (2nd and 3rd rows), a higher MAP
value was found when context was used, indicating that this
might actually add useful information, esp. with our
strategy of extending context boundaries to the closest segments.
Therefore, we hypothesize that using LIMSI transcripts for
the A task would lead to an even lower MAP compared to
the LIMSI transcripts for the C task.</p>
        <p>The precision of our system increases from top 5 to top
10 and decrease a bit at top 20. Our best system reaches
close-to-average MAP on anchors 31 and 39 (respectively
0.80 and 0.50), while the MRR of the corresponding search
queries (item 23 for 31, item 25 for 39) is close to zero. This
is an indication that the visual features may be helpful.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Submission I V M O T6V4 C S V M O T8V2 C S V M O T8V2 A</title>
        <p>P 5
0.620
0.400
0.400
P 10
0.583
0.443
0.433
P 20
0.413
0.370
0.340
MAP
1.00
0.832
0.782
4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by the Swiss National Science
Foundation (AROLES project n. 51NF40-144627) and by
the European Union (inEvent project FP7-ICT n. 287872).
We would like to thank Maria Eskevich and Robin Aly for
their valuable help with the task.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bird</surname>
          </string-name>
          .
          <article-title>NLTK: the Natural Language Toolkit</article-title>
          . In COLING/ACL Interactive Presentations, Sydney,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Chat eld and al. The devil is in the details: an evaluation of recent feature encoding methods</article-title>
          .
          <source>In British Machine Vision Conference</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          .
          <article-title>The Search and Hyperlinking Task at MediaEval 2013</article-title>
          . In MediaEval 2013 Workshop, Barcelona, Spain, October
          <volume>18</volume>
          -19
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          and al.
          <source>The LIMSI Broadcast News Transcription System. Speech Communication</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Massoudi</surname>
          </string-name>
          and al.
          <article-title>A video ngerprint based on visual digest and local ngerprints</article-title>
          .
          <source>In ICIP</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Pappas</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu-Belis</surname>
          </string-name>
          .
          <article-title>Combining content with user preferences for ted lecture recommendation</article-title>
          .
          <source>In Content Based Multimedia Indexing (CBMI)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          and al.
          <source>LIUM's SMT machine translation systems for WMT 2011. In 6th Workshop on Statistical Machine Translation</source>
          , pages
          <volume>464</volume>
          {
          <fpage>469</fpage>
          ,
          <string-name>
            <surname>Edinburgh</surname>
          </string-name>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>