=Paper=
{{Paper
|id=Vol-1436/Paper81
|storemode=property
|title=EURECOM @ SAVA2015: Visual Features for Multimedia Search
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper81.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/EskevichH15
}}
==EURECOM @ SAVA2015: Visual Features for Multimedia Search==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper81.pdf</pdf>
<pre>
                               EURECOM @ SAVA2015:
                        Visual Features for Multimedia Search

                                               Maria Eskevich, Benoit Huet
                                             EURECOM, Sophia Antipolis, France
                              maria.eskevich@gmail.com; benoit.huet@eurecom.fr


ABSTRACT                                                            used in the video and those describing the visual concepts,
This paper describes our approach to carry out multimedia           as the speakers in the videos might not directly describe the
search by connecting the textual information, or the corre-         visual content, but it might be implied in the further lexical
sponding textual description of the required visual content,        context of the topic of their speech.
in the user query to the audio-visual content of the videos            We use the dataset of the Search and Anchoring task at
within the collection. The experiments were carried out on          MediaEval 2015 [5] that contains both textual and visual de-
the dataset of the Search and Anchoring in Video Archives           scriptions of the required content, thus we can compare the
(SAVA) task at MediaEval 2015, consisting of roughly 2700           influence of words vectors similarity for the cases when we
hours of the BBC TV broadcast material. We combined                 establish the connection between the textual query and the
visual concepts extraction confidence scores with the infor-        visual content within the collection, and between the tex-
mation about corresponding word vectors distances in order          tual description of the visual request and the visual content
to rerank the baseline text based search. The reranked runs         within the collection.
did not outperform the baseline, however they exposed po-
tential of our method for further improvement.                      2.     SYSTEM OVERVIEW
                                                                       To compare the impact of our approach, we create a base-
1.   INTRODUCTION                                                   line run that all further implementations are based upon.
                                                                       First, we divide all the videos in the collection into seg-
   Issuing a textual query for a search within a multimedia
                                                                    ments of a fixed length of 120 seconds with a 30 seconds
collection is a task that is familiar to the Internet users nowa-
                                                                    overlap step. We store the corresponding LIMSI transcripts
days. The systems performing this search are usually based
                                                                    [8] as the documents collection, and the information about
on the corresponding transcript content of the videos or on
                                                                    the start of the first word after a pause longer than 0.5 sec-
the available metadata. The link between the given textual
                                                                    onds or a first switch of speakers as the potential jump-in
description of the query, or of the required visual content,
                                                                    point for each segment, as in [6].
and the visual features that can be automatically extracted
                                                                       Second, we use the open-source Terrier 4.0. Information
for all the videos in the collection has not been thoroughly
                                                                    Retrieval platform1 [11] with a standard language modeling
investigated. In [13] the visual content was used to impose
                                                                    implementation [7], with default lamda value equal to 0.15,
the segmentation units, while in [2] and [4] the visual con-
                                                                    for indexing and retrieval. The resulting top 1000 segments
cepts were used for reranking of the result list for the case
                                                                    for each of the 30 queries represent the baseline result after
of search performed for the hyperlinking task, i.e., video to
                                                                    the removal of the overlapping results.
video search. However, as the reliability of the extracted vi-
                                                                       Third, for these top 1000 segments we calculate a new
sual concepts and the types of the concepts themselves vary
                                                                    confidence score that represents a combination of three val-
based on the training data and the task framework, it is
                                                                    ues, see Equation 1: i) confidence score of the terms that are
still hard to transfer these systems output from one collec-
                                                                    present both in the query, textual or visual field, (CQ wi ) and
tion or task to another while keeping the same impact on
                                                                    in the visual concepts extracted for the segment (CV C wi );
improvement.
                                                                    ii) confidence score of the terms that are present both in
   In this paper we describe our experiments that attempt
                                                                    the query, textual or visual field, (CQ wi ) and in lexical
to create this link between the visual/textual content of the
                                                                    context of the visual concepts extracted for the segment
query and the visual features of the collection by incorporat-
                                                                    (CW 2V 4V C wi ); iii) confidence score of the terms that are
ing the information about the words vectors distance into the
                                                                    present both in the lexical context of the query, textual
confidence scores calculation. We take into account not only
                                                                    or visual field, (CW 2V 4Q wi ) and in the visual concepts ex-
the actual query words and words assigned to the visual con-
                                                                    tracted for the segment (CV C wi ). We empirically chose
cepts, but also their lexical context, calculated as close word
                                                                    to assign higher value (0.6) to the confidence score of the
vectors following the word2vec approach [10]. By expanding
                                                                    first type, as those are the words used in the transcripts
the list of terms for comparison by the lexical context, we
                                                                    and visual concepts, and lower equal values (0.2) for the
attempt to deal with the potential mismatch of the terms
                                                                    scores using the lexical context, see Equation 1. We use the
                                                                    open-source implementation of the word2vec algorithm 2
                                                                    1
Copyright is held by the author/owner(s).                               http://www.terrier.org
                                                                    2
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany             http://word2vec.googlecode.com/svn/trunk/
                                              Table 1: Precision at ranks 5, 10, 20.
           Query fields      Visual              P@5                     P@10                           P@20
              used          concepts    overlap   bin     tol    overlap   bin      tol       overlap     bin       tol
              text            none      0.6733 0.6400 0.6133 0.6133 0.5933 0.5467             0.4067    0.3983    0.3133
              text           Oxford     0.4533 0.4467 0.400      0.4233 0.4167 0.3767         0.3133    0.3367    0.2667
             visual          Oxford     0.4933 0.5000 0.4733 0.4633 0.4900 0.4333             0.3367    0.3683    0.2917
              text           Leuven     0.4667 0.4333 0.4400 0.4567 0.4500 0.4300             0.3450    0.3667    0.3017
             visual          Leuven     0.4400 0.4533 0.4000 0.4500 0.4333 0.4200             0.3500    0.3667    0.2883
              text          CERTH       0.3600 0.3467 0.3400 0.3333 0.3467 0.3200             0.2450    0.2567    0.2167
             visual         CERTH       0.3733 0.3600 0.3400 0.4133 0.3900 0.3933             0.2933    0.3050    0.2600

                                            Table 2: Official metrics for all the runs
                            Query fields used Visual concepts MAP       MAP bin MAP tol           MAiSP
                            text               none            0.5511 0.3529        0.3089        0.3431
                            text               Oxford          0.3196 0.2739        0.2053        0.2978
                            visual             Oxford          0.3368 0.2958        0.2293        0.3092
                            text               Leuven          0.3227 0.2801        0.2187        0.2958
                            visual             Leuven          0.3394 0.2970        0.2222        0.3117
                            text               CERTH           0.2295 0.2027        0.1554        0.1983
                            visual             CERTH           0.2624 0.2375        0.1822        0.2380


with the pre-trained vectors trained on part of Google News          4.   CONCLUSION AND FUTURE WORK
dataset 3 (about 100 billion words), cf. [9]. We take the               In this paper we have described a new approach to com-
top 100 word2vec output for consideration, remove the stop           bine the confidence scores of the visual concepts extraction
words from both the query and the word2vec output, and               and the textual description of the query, weighted by the
run Porter Stemmer [12] on all lists for normalization.              closeness of the terms in the words vector space.
  Finally, the new confidence score values are used for the             Even though as expected we achieve higher scores for the
reranking of the initial results, these are filtered for the over-   runs using the closeness between the visual descriptions of
lapping segments, and the jump-in points of the segments             the queries and the visual concepts, we achieve comparable
are used as start times.                                             results when using the textual descriptions. Therefore we
                                                                     envisage that further tuning of the confidence scores combi-
                     P NQ V C                                        nation and reranking strategies can bring the results to the
                          i=1   (CQ wi ∗ CV C wi )                   level of baseline and further improvement.
     Conf Score =                                  ∗ 0.6+
                                 NQ V C
           PNQ W 2V 4V C
             i=1           (CQ wi ∗ CW 2V 4V C wi )                  5.   ACKNOWLEDGMENTS
       +                                            ∗ 0.2+    (1)
                          NQ W 2V 4V C                                 This work was supported by the European Commission’s
             PNW 2V 4Q V C                                           7th Framework Programme (FP7) under FP7-ICT 287911
               i=1          (CW 2V 4Q wi ∗ CV C wi )
         +                                           ∗ 0.2           (LinkedTV); Bpifrance within the NexGen-TV Project, un-
                           NW 2V 4Q V C
                                                                     der grant number F1504054U.

3.   EXPERIMENTAL RESULTS                                            6.   REFERENCES
   Tables 1-2 show the evaluation results of the submissions.
In both tables each line represent the an approach that used          [1] E. E. Apostolidis, V. Mezaris, M. Sahuguet, B. Huet,
textual or visual query field (first column) and visual con-              B. Cervenková, D. Stein, S. Eickeler, J. L. R. Garcı́a,
cepts extracted by Oxford [3], Leuven [14] or CERTH [1]                   R. Troncy, and L. Pikora. Automatic fine-grained
systems. Although none of these runs outperforms the base-                hyperlinking of videos within a closed collection using
line, some trends can be tracked. According to all of the                 scene segmentation. In Proceedings of the ACM
metrics in Table 2 the runs that use the connection between               International Conference on Multimedia, MM ’14,
the visual query field and the visual concepts extracted for              Orlando, FL, USA, November 03 - 07, 2014, pages
the collection achieve higher scores than the runs using the              1033–1036, 2014.
textual fields. This means that at least partly these visual          [2] C. Bhatt, N. Pappas, M. Habibi, and
concepts defined for the other task and extracted for this col-           A. Popescu-Belis. Idiap at MediaEval 2013: Search
lection can be transferred to be used in this task. In terms              and Hyperlinking Task. In MediaEval 2013 Workshop,
of precision, the trends is not as consistent, as only the runs           2013.
that use the Oxford and CERTH visual concepts have better             [3] K. Chatfield and A. Zisserman. Visor: Towards
scores when the visual query description is used for all the              on-the-fly large-scale object category retrieval. In
measurements, and the results based on the Leuven visual                  Computer Vision–ACCV 2012, pages 432–446.
concepts extraction vary between different measurements.                  Springer, 2013.
                                                                      [4] S. Chen, M. Eskevich, G. J. F. Jones, and N. E.
3                                                                         O’Connor. An investigation into feature effectiveness
  https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUT
TlSS21pQmM/edit                                                           for multimedia hyperlinking. In MultiMedia Modeling -
     20th Anniversary International Conference, MMM
     2014, Dublin, Ireland, January 6-10, 2014,
     Proceedings, Part II, pages 251–262, 2014.
 [5] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman,
     S. Chen, and G. J. F. Jones. SAVA at mediaeval 2015:
     Search and anchoring in video archives. In Working
     Notes Proceedings of the MediaEval 2015 Workshop,
     Wurzen, Germany, 2015.
 [6] M. Eskevich and G. J. F. Jones. Time-based
     segmentation and use of jump-in points in DCU
     search runs at the search and hyperlinking task at
     mediaeval 2013. In Proceedings of the MediaEval 2013
     Multimedia Benchmark Workshop, Barcelona, Spain,
     October 18-19, 2013., 2013.
 [7] D. Hiemstra. Using language models for information
     retrieval. PhD thesis, University of Twente, The
     Netherlands, 2001.
 [8] L. Lamel and J.-L. Gauvain. Speech processing for
     audio indexing. In Advances in Natural Language
     Processing (LNCS 5221), pages 4–15. Springer, 2008.
 [9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and
     J. Dean. Distributed representations of words and
     phrases and their compositionality. In Advances in
     Neural Information Processing Systems 26: 27th
     Annual Conference on Neural Information Processing
     Systems 2013. Proceedings of a meeting held December
     5-8, 2013, Lake Tahoe, Nevada, United States., pages
     3111–3119, 2013.
[10] T. Mikolov, W. tau Yih, and G. Zweig. Linguistic
     regularities in continuous space word representations.
     In Proceedings of the 2013 Conference of the North
     American Chapter of the Association for
     Computational Linguistics: Human Language
     Technologies (NAACL-HLT-2013), May 2013.
[11] I. Ounis, G. Amati, V. Plachouras, B. He,
     C. Macdonald, and C. Lioma. Terrier: A High
     Performance and Scalable Information Retrieval
     Platform. In Proceedings of ACM SIGIR’06 Workshop
     on Open Source Information Retrieval (OSIR 2006),
     2006.
[12] M. F. Porter. An Algorithm for Suffix Stripping.
     Program, 14(3):130–137, 1980.
[13] M. Sahuguet, B. Huet, B. Cervenková, E. E.
     Apostolidis, V. Mezaris, D. Stein, S. Eickeler, J. L. R.
     Garcı́a, and L. Pikora. Linkedtv at mediaeval 2013
     search and hyperlinking task. In Proceedings of the
     MediaEval 2013 Multimedia Benchmark Workshop,
     Barcelona, Spain, October 18-19, 2013., 2013.
[14] T. Tommasi, T. Tuytelaars, and B. Caputo. A testbed
     for cross-dataset analysis. CoRR, abs/1402.5923, 2014.

</pre>