<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>EURECOM @ SAVA2015: Visual Features for Multimedia Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Eskevich</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benoit Huet EURECOM</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sophia Antipolis</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France maria.eskevich@gmail.com</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>benoit.huet@eurecom.fr</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper describes our approach to carry out multimedia search by connecting the textual information, or the corresponding textual description of the required visual content, in the user query to the audio-visual content of the videos within the collection. The experiments were carried out on the dataset of the Search and Anchoring in Video Archives (SAVA) task at MediaEval 2015, consisting of roughly 2700 hours of the BBC TV broadcast material. We combined visual concepts extraction con dence scores with the information about corresponding word vectors distances in order to rerank the baseline text based search. The reranked runs did not outperform the baseline, however they exposed potential of our method for further improvement.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Issuing a textual query for a search within a multimedia
collection is a task that is familiar to the Internet users
nowadays. The systems performing this search are usually based
on the corresponding transcript content of the videos or on
the available metadata. The link between the given textual
description of the query, or of the required visual content,
and the visual features that can be automatically extracted
for all the videos in the collection has not been thoroughly
investigated. In [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] the visual content was used to impose
the segmentation units, while in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] the visual
concepts were used for reranking of the result list for the case
of search performed for the hyperlinking task, i.e., video to
video search. However, as the reliability of the extracted
visual concepts and the types of the concepts themselves vary
based on the training data and the task framework, it is
still hard to transfer these systems output from one
collection or task to another while keeping the same impact on
improvement.
      </p>
      <p>
        In this paper we describe our experiments that attempt
to create this link between the visual/textual content of the
query and the visual features of the collection by
incorporating the information about the words vectors distance into the
con dence scores calculation. We take into account not only
the actual query words and words assigned to the visual
concepts, but also their lexical context, calculated as close word
vectors following the word2vec approach [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. By expanding
the list of terms for comparison by the lexical context, we
attempt to deal with the potential mismatch of the terms
used in the video and those describing the visual concepts,
as the speakers in the videos might not directly describe the
visual content, but it might be implied in the further lexical
context of the topic of their speech.
      </p>
      <p>
        We use the dataset of the Search and Anchoring task at
MediaEval 2015 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] that contains both textual and visual
descriptions of the required content, thus we can compare the
in uence of words vectors similarity for the cases when we
establish the connection between the textual query and the
visual content within the collection, and between the
textual description of the visual request and the visual content
within the collection.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM OVERVIEW</title>
      <p>To compare the impact of our approach, we create a
baseline run that all further implementations are based upon.</p>
      <p>
        First, we divide all the videos in the collection into
segments of a xed length of 120 seconds with a 30 seconds
overlap step. We store the corresponding LIMSI transcripts
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as the documents collection, and the information about
the start of the rst word after a pause longer than 0.5
seconds or a rst switch of speakers as the potential jump-in
point for each segment, as in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        Second, we use the open-source Terrier 4.0. Information
Retrieval platform1 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] with a standard language modeling
implementation [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], with default lamda value equal to 0.15,
for indexing and retrieval. The resulting top 1000 segments
for each of the 30 queries represent the baseline result after
the removal of the overlapping results.
      </p>
      <p>Third, for these top 1000 segments we calculate a new
con dence score that represents a combination of three
values, see Equation 1: i) con dence score of the terms that are
present both in the query, textual or visual eld, (CQ wi ) and
in the visual concepts extracted for the segment (CV C wi );
ii) con dence score of the terms that are present both in
the query, textual or visual eld, (CQ wi ) and in lexical
context of the visual concepts extracted for the segment
(CW 2V 4V C wi ); iii) con dence score of the terms that are
present both in the lexical context of the query, textual
or visual eld, (CW 2V 4Q wi ) and in the visual concepts
extracted for the segment (CV C wi ). We empirically chose
to assign higher value (0.6) to the con dence score of the
rst type, as those are the words used in the transcripts
and visual concepts, and lower equal values (0.2) for the
scores using the lexical context, see Equation 1. We use the
open-source implementation of the word2vec algorithm 2
1http://www.terrier.org
2http://word2vec.googlecode.com/svn/trunk/
Query elds
used
text
text
visual
text
visual
text
visual</p>
      <p>
        Visual
concepts
none
Oxford
Oxford
Leuven
Leuven
CERTH
CERTH
with the pre-trained vectors trained on part of Google News
dataset 3 (about 100 billion words), cf. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. We take the
top 100 word2vec output for consideration, remove the stop
words from both the query and the word2vec output, and
run Porter Stemmer [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] on all lists for normalization.
      </p>
      <p>Finally, the new con dence score values are used for the
reranking of the initial results, these are ltered for the
overlapping segments, and the jump-in points of the segments
are used as start times.</p>
      <p>Conf Score =</p>
      <p>PiN=Q1 V C (CQ wi</p>
      <p>CV C wi )</p>
      <p>NQ V C
+
+
PiN=Q1 W 2V 4V C (CQ wi</p>
      <p>CW 2V 4V C wi )</p>
      <p>NQ W 2V 4V C
PiN=W1 2V 4Q V C (CW 2V 4Q wi</p>
      <p>CV C wi )
NW 2V 4Q V C
0:6+
0:2+
0:2
(1)</p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTAL RESULTS</title>
      <p>
        Tables 1-2 show the evaluation results of the submissions.
In both tables each line represent the an approach that used
textual or visual query eld ( rst column) and visual
concepts extracted by Oxford [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Leuven [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or CERTH [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
systems. Although none of these runs outperforms the
baseline, some trends can be tracked. According to all of the
metrics in Table 2 the runs that use the connection between
the visual query eld and the visual concepts extracted for
the collection achieve higher scores than the runs using the
textual elds. This means that at least partly these visual
concepts de ned for the other task and extracted for this
collection can be transferred to be used in this task. In terms
of precision, the trends is not as consistent, as only the runs
that use the Oxford and CERTH visual concepts have better
scores when the visual query description is used for all the
measurements, and the results based on the Leuven visual
concepts extraction vary between di erent measurements.
3https://drive.google.com/ le/d/0B7XkCwpI5KDYNlNUT
TlSS21pQmM/edit
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper we have described a new approach to
combine the con dence scores of the visual concepts extraction
and the textual description of the query, weighted by the
closeness of the terms in the words vector space.</p>
      <p>Even though as expected we achieve higher scores for the
runs using the closeness between the visual descriptions of
the queries and the visual concepts, we achieve comparable
results when using the textual descriptions. Therefore we
envisage that further tuning of the con dence scores
combination and reranking strategies can bring the results to the
level of baseline and further improvement.
5.</p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by the European Commission's
7th Framework Programme (FP7) under FP7-ICT 287911
(LinkedTV); Bpifrance within the NexGen-TV Project,
under grant number F1504054U.
6.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. E.</given-names>
            <surname>Apostolidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahuguet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cervenkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eickeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L. R.</given-names>
            <surname>Garc</surname>
          </string-name>
          <string-name>
            <surname>a</surname>
          </string-name>
          , R. Troncy, and
          <string-name>
            <given-names>L.</given-names>
            <surname>Pikora</surname>
          </string-name>
          .
          <article-title>Automatic ne-grained hyperlinking of videos within a closed collection using scene segmentation</article-title>
          .
          <source>In Proceedings of the ACM International Conference on Multimedia, MM '14</source>
          ,
          <string-name>
            <surname>Orlando</surname>
          </string-name>
          , FL, USA, November
          <volume>03</volume>
          -
          <issue>07</issue>
          ,
          <year>2014</year>
          , pages
          <fpage>1033</fpage>
          {
          <fpage>1036</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pappas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Habibi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu-Belis</surname>
          </string-name>
          . Idiap at MediaEval 2013:
          <article-title>Search and Hyperlinking Task</article-title>
          .
          <source>In MediaEval 2013 Workshop</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Chat eld and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          . Visor:
          <article-title>Towards on-the- y large-scale object category retrieval</article-title>
          .
          <source>In Computer Vision{ACCV</source>
          <year>2012</year>
          , pages
          <fpage>432</fpage>
          {
          <fpage>446</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          , and
          <string-name>
            <surname>N. E. O'Connor.</surname>
          </string-name>
          <article-title>An investigation into feature e ectiveness for multimedia hyperlinking</article-title>
          .
          <source>In MultiMedia Modeling - 20th Anniversary International Conference, MMM 2014</source>
          , Dublin, Ireland, January 6-
          <issue>10</issue>
          ,
          <year>2014</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , pages
          <volume>251</volume>
          {
          <fpage>262</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Racca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          . SAVA at mediaeval 2015:
          <article-title>Search and anchoring in video archives</article-title>
          .
          <source>In Working Notes Proceedings of the MediaEval 2015 Workshop</source>
          , Wurzen, Germany,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Time-based segmentation and use of jump-in points in DCU search runs at the search and hyperlinking task at mediaeval 2013</article-title>
          .
          <source>In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop</source>
          , Barcelona, Spain,
          <source>October 18-19</source>
          ,
          <year>2013</year>
          .,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hiemstra</surname>
          </string-name>
          .
          <article-title>Using language models for information retrieval</article-title>
          .
          <source>PhD thesis</source>
          , University of Twente, The Netherlands,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          .
          <article-title>Speech processing for audio indexing</article-title>
          .
          <source>In Advances in Natural Language Processing (LNCS 5221)</source>
          , pages
          <fpage>4</fpage>
          <lpage>{</lpage>
          15. Springer,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8</source>
          ,
          <year>2013</year>
          ,
          <string-name>
            <given-names>Lake</given-names>
            <surname>Tahoe</surname>
          </string-name>
          , Nevada, United States., pages
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , W. tau Yih, and
          <string-name>
            <given-names>G.</given-names>
            <surname>Zweig</surname>
          </string-name>
          .
          <article-title>Linguistic regularities in continuous space word representations</article-title>
          .
          <source>In Proceedings of the</source>
          <year>2013</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT-</article-title>
          <year>2013</year>
          ), May
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          .
          <article-title>Terrier: A High Performance and Scalable Information Retrieval Platform</article-title>
          .
          <source>In Proceedings of ACM SIGIR'06 Workshop on Open Source Information Retrieval (OSIR</source>
          <year>2006</year>
          ),
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>An Algorithm for Su x Stripping</article-title>
          . Program,
          <volume>14</volume>
          (
          <issue>3</issue>
          ):
          <volume>130</volume>
          {
          <fpage>137</fpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahuguet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Huet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Cervenkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. E.</given-names>
            <surname>Apostolidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mezaris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Eickeler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L. R.</given-names>
            <surname>Garc</surname>
          </string-name>
          <string-name>
            <surname>a</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Pikora</surname>
          </string-name>
          .
          <article-title>Linkedtv at mediaeval 2013 search and hyperlinking task</article-title>
          .
          <source>In Proceedings of the MediaEval 2013 Multimedia Benchmark Workshop</source>
          , Barcelona, Spain,
          <source>October 18-19</source>
          ,
          <year>2013</year>
          .,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Tommasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tuytelaars</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Caputo</surname>
          </string-name>
          .
          <article-title>A testbed for cross-dataset analysis</article-title>
          .
          <source>CoRR, abs/1402.5923</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>