=Paper= {{Paper |id=Vol-1263/paper13 |storemode=property |title=LinkedTV at MediaEval 2014 Search and Hyperlinking Task |pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_13.pdf |volume=Vol-1263 |dblpUrl=https://dblp.org/rec/conf/mediaeval/LeBHCBAMPMSES14 }} ==LinkedTV at MediaEval 2014 Search and Hyperlinking Task== https://ceur-ws.org/Vol-1263/mediaeval2014_submission_13.pdf
LinkedTV at MediaEval 2014 Search and Hyperlinking Task

H.A. Le1 , Q.M. Bui1 , B. Huet1 , B. Červenková2 , J. Bouchner2 , E. Apostolidis3 , F. Markatopoulou3 ,
            A. Pournaras3 , V. Mezaris3 , D. Stein4 , S. Eickeler4 , and M. Stadtschnitzer4
                                        1
                                            Eurecom, Sophia Antipolis, France. huet@eurecom.fr
                         2
                             University of Economics, Prague, Czech Republic. barbora.cervenkova@vse.cz
                    3
                        Information Technologies Institute, CERTH, Thessaloniki, Greece. bmezaris@iti.gr
                        4
                            Fraunhofer IAIS, Sankt Augustin, Germany. daniel.stein@iais.fraunhofer.de


ABSTRACT                                                               sponding to a unique visual concept response.
The paper presents the LinkedTV approaches for the Search
and Hyperlinking (S&H) task at MediaEval 2014. Our sub-
                                                                       1.1   Temporal Granularity
missions aim at evaluating 2 key dimensions: temporal gran-               Three temporal granularities are evaluated. The first,
ularity and visual properties of the video segments. The               termed Text-Segment, consists in grouping together sentences
temporal granularity of target video segments is defined by            (up to 40) from the text sources. We also propose to segment
grouping text sentences, or consecutive automatically de-              videos into scenes which consist of semantically correlated
tected shots, considering the temporal coherence, the vi-              adjacent shots. Two strategies are employed to create scene
sual similarity and the lexical cohesion among them. Visual            level temporal segments. Visually similar adjacent shots are
properties are combined with text search results using mul-            merged together to create Visual-scenes [10], while Topic-
timodal fusion for re-ranking. Two alternative methods are             scenes are built by jointly considering the aforementioned
proposed to identify which visual concepts are relevant to             results of visual scene segmentation and text-based topical
each query: using WordNet similarity or Google Image anal-             cohesion (exploiting text extracted from ASR transcripts or
ysis. For Hyperlinking, relevant visual concepts are identi-           subtitles).
fied by analysing the video anchor.
                                                                       1.2   Visual Properties
1.     INTRODUCTION                                                      In MediaEval S&H 2014, queries are composed of a few
                                                                       keywords only (visual-cues are not provided). Hence, the
   This paper describes the framework used by the LinkedTV
                                                                       identification of relevant visual concepts is more complex
team to tackle the problem of Search and Hyperlinking in-
                                                                       than last year. We propose two alternatives to this problem.
side a video collection [3]. The applied techniques originate
                                                                       On one hand, WordNet similarity is employed to map visual
from the LinkedTV project1 , which aims at integrating TV
                                                                       concepts with query terms [8]. On the other hand, the query
and Web documents, by enabling users to access additional
                                                                       terms are used to perform a Google Image search. Visual
information and media resources aggregated from diverse
                                                                       concept detection (using 151 concepts from the TRECVID
sources, thanks to automatic media annotation. Here fol-
                                                                       SIN task [6]) is performed on the first 100 returned im-
lows the description of our media annotation process. Shot
                                                                       ages and concepts obtaining the highest average score are
segmentation is performed using a variation of [1], while
                                                                       selected.
the selected keyframes (one per shot) are analysed by vi-
sual concept detection [9] and Optical Character Recogni-
tion (OCR) [11] techniques. For each video, keywords are               2.    SEARCH SUB-TASK
extracted from the subtitles, based on the algorithm pre-
sented in [12]. Finally, video shots are grouped into longer           2.1   Text-based methods
segments (scenes) based on 2 hierarchical clustering strate-
                                                                          In this approach, relevant text and video segments are
gies. Media annotations are indexed at 2 levels (video level
                                                                       searched using Solr using text (TXT ) only. Two strategies
and scene level) using the Apache Solr platform2 . At the
                                                                       are compared: one where search is performed at the text
video level, document descriptions are limited to text (ti-
                                                                       segment level directly (S ) and one where the first 50 videos
tle, subtitle, keywords, etc...), while the scene level docu-
                                                                       are retrieved at the video level and then the relevant video
ments are characterized by both text (subtitle/transcript,
                                                                       segment is locate using the scene-level index. The scene-
keywords, ocr, etc...) and float fields. Each float field corre-
                                                                       level index granularity is either the Visual-Scene (VS ) or
1
    http://www.linkedtv.eu/                                            the Topic-Scene (TS ). Scenes at both granularities are char-
2
    http://lucene.apache.org/solr/                                     acterized by textual information only (either the subtitle
                                                                       (M ) or one of the 3 ASR transcripts ( (U ) LIUM [7], (I )
                                                                       LIMSI [4], (S ) NST/Sheffield [5])).
Copyright is held by the author/owner(s).
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain
                                                                       2.2   Multimodal Fusion method
                                                                      Table 2: Results of the Hyperlinking sub-task
  Motivated by [8], visual concept scores are fused with                  Run             map        P5       P 10     P 20
text-based results from Solr to perform re-ranking. Rele-             TXT S MLT2 I       0,0502     0,2333    0,1833   0,1117
vant visual concepts, out of the 151 available, for individual        TXT S MLT2 M       0,1201     0,3667    0,3267   0,2217
queries are identified using either the WordNet (WN ) or the          TXT S MLT2 S       0,0855     0,2067    0,2233   0,1717
GoogleImage (GI ) strategy. For those multi-modal (MM )                 TXT VS M         0,2524     0,504     0,448    0,328
runs only visual scene (VS ) segmentation is evaluated.               TXT S MLT1 I       0,0798       0,3     0,2462   0,1635
                                                                      TXT S MLT1 M       0,1511     0,4167    0,375    0,2687
                                                                      TXT S MLT1 S       0,1118       0,3     0,2857   0,2143
3.    HYPERLINKING SUB-TASK                                           TXT S MLT1 U       0,1068     0,2692    0,2577   0,2038
                                                                        MM VS M          0,1201       0,3     0,2885   0,1923
   Pivotal to the hyperlinking task is the ability to automat-          MM TS M          0,1048     0,3538    0,2654   0,1692
ically craft an effective query from the video anchor under
consideration, to search within the annotated set of media.      5.    CONCLUSION
We submitted two alternative approaches; One using the              The results of LinkedTV’s approaches on the 2014 Medi-
MoreLikeThis (MLT ) Solr extension, and the other using          aEval S&H task show that it is difficult to improve over text
Solr’s query engine. MLT is used in combination with the         based approaches when no visual cues are provided. Over-
sentence segments (S ), using either text (MLT1 ) or text        all, our S&H algorithms performance on this year’s dataset
and annotations [2] (MLT2 ). When Solr is used directly,         have decreased compared to 2013, showing that task defini-
we consider text only (TXT ) or with visual concept scores       tion changes have made the task harder to solve.
of anchors (MM ) to formulate queries. Keywords appearing
within the query anchor’s subtitles compose the textual part     6.    ACKNOWLEDGMENTS
of the query. Visual concepts whose scores within the query
                                                                   This work was supported by the European Commission under
anchor exceed the 0.7 threshold are identified as relevant to    contract FP7-287911 LinkedTV.
the video anchor and added to the Solr query. Both visual
(VS ) and topic scenes (TS ) granularities are evaluated in
this approach.                                                   7.    REFERENCES
                                                                  [1] E. Apostolidis and V. Mezaris. Fast shot segmentation
                                                                      combining global and local visual descriptors. In 2014
                                                                      IEEE Int. Conf. on Acoustics, Speech and Signal
4.    RESULTS                                                         Processing (ICASSP), p 6583–6587, Italy.
                                                                  [2] M. Dojchinovski and T. Kliegr. Entityclassifier.eu:
4.1    Search sub-task                                                Real-Time Classification of Entities in Text with
                                                                      Wikipedia. In H. Blockeel, K. Kersting, S. Nijssen, and
   Table 1 shows the performance of our search runs. Our              F. Železný, editors, Machine Learning and Knowledge
best performing approach (TXT VS M ), according to MAP,               Discovery in Databases, volume 8190 of Lecture Notes in
relies on manual transcript only segmented according to vi-           Computer Science, pages 654–658. Springer, 2013.
sual scenes. Looking at the precision scores at 5, 10 and 20,     [3] M. Eskevich, R. Aly, D.N. Racca, R. Ordelman, S. Chen,
one can notice that multi-modal approaches using Word-                and G.J.F. Jones. The Search and Hyperlinking Task at
                                                                      MediaEval 2014. In MediaEval 2014 Workshop, Spain.
Net (MM VS WN M ) and Google images (MM VS GI M )
                                                                  [4] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI
boost the performance of text only approaches. There is a             broadcast news transcription system. Speech
clear performance drop whenever ASR (I, U or S ) are em-              Communication, 37(1):89–108, 2002.
ployed, instead of subtitles (M ).                                [5] T. Hain, A. El Hannani, S. N. Wrigley, and V. Wan.
                                                                      Automatic speech recognition for scientific purposes-webasr.
       Table 1: Results of the Search sub-task                        In Interspeech, Australia, pages 504–507, 2008.
         Run          map       P5       P 10      P 20
                                                                  [6] P. Over et al. TRECVID 2012 – An Overview of the Goals,
      TXT TS I       0,4664    0,6533   0,6167    0,5317
                                                                      Tasks, Data, Evaluation Mechanisms and Metrics. In
      TXT TS M       0,4871    0,6733   0,6333     0,545
                                                                      Proceedings of TRECVID 2012. NIST, USA, 2012.
      TXT TS S       0,4435     0,66    0,6367     0,54
                                                                  [7] A. Rousseau, P. Deléglise, and Y. Estève. Enhancing the
      TXT TS U       0,4205    0,6467     0,6     0,5133
                                                                      TED-LIUM corpus with selected data for language
       TXT S I       0,2784    0,6467    0,57     0,4133              modeling and more TED talks. In LREC 2014, Iceland.
       TXT S M       0,3456    0,6333   0,5933     0,48
                                                                  [8] B. Safadi, M. Sahuguet, and B. Huet. When textual and
       TXT S S       0,1672    0,3926   0,3815    0,3019              visual information join forces for multimedia retrieval. In
       TXT S U       0,3144     0,66    0,6233     0,48               ACM ICMR 2014, Glasgow, Scotland.
      TXT VS I       0,4672     0,66     0,62      0,53           [9] P. Sidiropoulos, V. Mezaris, and I. Kompatsiaris.
      TXT VS M       0,5172     0,68    0,6733    0,5933              Enhancing Video concept detection with the use of
      TXT VS S        0,465    0,6933   0,6367    0,5317              tomographs. In IEEE ICIP 2013, Australia.
      TXT VS U       0,4208    0,6267   0,6067     0,53          [10] P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo,
     MM VS WN M      0,5096      0,7    0,6967    0,5833              M. Bugalho, and I. Trancoso. Temporal Video
     MM VS GI M       0,509    0,6667    0,68     0,5933              Segmentation to Scenes Using High-Level Audiovisual
                                                                      Features. IEEE Transactions on Circuits and Systems for
4.2    Hyperlinking sub-task                                          Video Technology, 21(8):1163–1177, Aug. 2011.
                                                                 [11] D. Stein, S. Eickeler, R. Bardeli, E. Apostolidis, V. Mezaris,
   Table 2 shows the performance of our hyperlinking runs.
                                                                      and M. Müller. Think Before You Link – Meeting Content
Again, the approach based on subtitle only (TXT VS M )                Constraints when Linking Television to the Web. In
performed best (MAP=0,25) followed by the approach using              Proc. NEM Summit, Nantes, France, Oct. 2013.
MoreLikeThis (TXT S MLT1 M ). Multi-modal approaches             [12] S. Tschopel and D. Schneider. A lightweight keyword and
did not produce the expected performance improvement.                 tag-cloud retrieval algorithm for automatic speech
We believe this is due to the significant duration reduction          recognition transcripts. In Interspeech, Japan, 2010.
of anchors compared with last year.