LinkedTV at MediaEval 2013 Search and Hyperlinking Task

     M. Sahuguet1 , B. Huet1 , B. Červenková2 , E. Apostolidis3 , V. Mezaris3 , D. Stein4 , S. Eickeler4 ,
                          J.L. Redondo Garcia1 , R. Troncy1 , and L. Pikora2
                            1
                                Eurecom, Sophia Antipolis, France. [sahuguet,huet,redondo,troncy]@eurecom.fr
                 2
                     University of Economics, Prague, Czech Republic. [barbora.cervenkova,lukas.pikora]@vse.cz
                       3
                           Information Technologies Institute, Thessaloniki, Greece. [apostolid,bmezaris]@iti.gr
            4
                Fraunhofer IAIS, Sankt Augustin, Germany. [daniel.stein,stefan.eickeler]@iais.fraunhofer.de


ABSTRACT                                                                3.2     From visual cues to detected concepts
This paper aims at presenting the results of LinkedTV’s first             Text search is straightforward with Lucene, by using the
participation to the Search and Hyperlinking task at Medi-              default text search based on TF-IDF values. In order to in-
aEval challenge 2013. We used textual information, tran-                corporate visual information to the search, we mapped key-
scripts, subtitles and metadata, and we tested their combi-             words extracted from the visual cues query (using Alchemy
nation with automatically detected visual concepts. Hence,              API2 ) to visual concepts using a semantic word distance
we submitted various runs to compare diverse approaches                 based on Wordnet synsets [5]. When visual concepts were
and see the improvement when adding visual information.                 detected in the query, we enriched the textual query by range
                                                                        queries on the values of the corresponding visual concepts.
1.     INTRODUCTION                                                     3.3     Search task
   This paper describes the framework used by the LinkedTV
                                                                           We concatenated textual and visual queries to perform
team to tackle the problem of Search and Hyperlinking in-
                                                                        the text query. Two strategies were adopted: we either used
side a video collection [2]. The applied techniques originate
                                                                        segments indexed in the Lucene engine, or performed queries
from the LinkedTV project1 , which aims at integrating TV
                                                                        creating segments on the fly, by merging video segments
and internet experience, by enabling the user to access ad-
                                                                        based on their score.
ditional information and media resources aggregated from
                                                                           Performing a text query on the video index often returned
diverse sources, thanks to automatic media annotation.
                                                                        the relevant video in the top of the list. Hence, some runs
                                                                        first restrict the pool of videos that are going to be searched
2.     PRE-PROCESSING STEP                                              to a small number, and then perform additional queries for
  Concept detection was performed on the key-frames of                  smaller segments inside this pool.
the video, following the approach in [6], while the algorithm
for Optical Character Recognition (OCR) described in [8]                  We submitted 9 runs in total:
was used for text localization. Moreover, for each video,               • scenes-C : Scene search using textual and visual cues.
we extracted keywords from the provided subtitles, based                • scenes-noC : Same as previous using textual cues only (no
on the algorithm presented in [9]. Finally, we grouped the                visual cues) for comparison purposes.
predefined video shots into bigger segments (scenes), based             • part-sc-C : Partial scenes search from shot boundary using
on the visual similarity and the temporal consistency among               textual and visual cues following three steps: filtering of
them, using the method introduced in [7].                                 the list of videos; querying for shots inside each video; or-
                                                                          dering them by score. As a shot is a unit that is too small
                                                                          to be returned to a viewer, we completed the segment with
3.     OUR FRAMEWORK                                                      the end of the scene that includes this shot.
                                                                        • part-sc-noC : Same as previous using textual cues only.
3.1     Lucene indexing                                                 • cl10-C : Temporal clustering of shots within a video using
   We indexed all available data in a Lucene index at dif-                text and visual cues in the following manner: filtering out
ferent granularities: video level, scene level, shot level and            the set of videos to search; computing scores for every
segments created using sliding window algorithm [3]. Doc-                 shot in the video; clustering together shots closer than 10
uments were represented by both textual fields (for a text                seconds apart (scores were added to form the final score).
search) and floating point fields (for the visual concepts).            • cl10-noC :Same as previous using text search only.
                                                                        • scenes-S or scenes-U or scenes-I : Scene search using only
1
    http://www.linkedtv.eu/                                               textual cue from transcript or subtitle, no metadata.
                                                                        • SW-60-I or SW-60-S : Search over segments created by
                                                                          the sliding window algorithm for LIMSI/Vocapia [4] tran-
Copyright is held by the author/owner(s).                               2
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain              http://www.alchemyapi.com/
  scripts and subtitles, where the size of the sliding windows
  is 60.                                                                Table 2: Results of the Hyperlinking task
• SW-40-U : Same as above for LIUM transcript with sliding               Run      MAP       P-5    P-10     P-20
  window size of 40.                                                    LA cl10   0.0577   0.4467  0.3200   0.2067
                                                                       LA MLT     0.1201   0.4200  0.4200   0.3217
3.4    Hyperlinking task                                               LA scenes 0.1770 0.6867 0.5867 0.4167
   A first approach consisted in reusing the search compo-              LC cl10   0.0823   0.5733  0.4833   0.2767
nent with the scene approach and the shot clustering ap-               LC MLT     0.1820   0.5667  0.5667   0.4300
proach. A query was crafted from the anchor: the text                  LC scenes 0.2523 0.8133 0.7300 0.5283
query was made by extracting keywords from the subtitle
aligned at start time and end time of the anchor. Visual
concepts scores were extracted from the keyframes of shots        5.   CONCLUSION
contained in the anchor. If the anchor was constituted by            This paper presented our framework and results at the
more than one shot, we took for each concept the highest          MediaEval Search and Hyperlinking task. From our runs,
score over all shots.                                             it is clear that scene segmentation is the approach with the
   A second approach made use of MorelikeThis Solr com-           best performances. Therefore, this approach should be stud-
ponent (MLT) combined with Entityclassifier.eu annotation         ied more in depth, a potential improvement being to refine
[1]. We created a temporary document from the query as            the segmentation using semantics or speakers information.
the root for searching similar documents, and performed the       Also, we see here that this task benefits from the use of vi-
search over segments from the LIMSI transcripts created us-       sual information present in the video. Hence, those two axes
ing sliding windows and enriched with synonyms.                   should be the next steps to study for a future challenge.

4.    RESULTS                                                     6.   ACKNOWLEDGMENTS
                                                                    This work was supported by the European Commission under
4.1    Search task                                                contracts FP7-287911 LinkedTV and FP7-318101 MediaMixer.
   The results of the search task are listed in Table 1. We
first notice than given the same conditions, subtitles perform    7.   REFERENCES
                                                                  [1] M. Dojchinovski and T. Kliegr. Entityclassifier.eu:
significantly better than any of the transcripts, which is an         Real-Time Classification of Entities in Text with Wikipedia.
expected outcome. It is also interesting to note that using           In H. Blockeel, K. Kersting, S. Nijssen, and F. Železný,
the visual concepts in the query slightly increases the results       editors, Machine Learning and Knowledge Discovery in
for all measures (e.g., clustering10-C vs clustering10-noC).          Databases, volume 8190 of Lecture Notes in Computer
                                                                      Science, pages 654–658. Springer Berlin Heidelberg, 2013.
                                                                  [2] M. Eskevich, J. Gareth J.F., S. Chen, R. Aly, and
         Table 1: Results of the Search task                          R. Ordelman. The Search and Hyperlinking Task at
            Run     MRR mGAP MASP                                     MediaEval 2013. In MediaEval 2013 Workshop, Barcelona,
                                                                      Spain, October 18-19 2013.
         scenes-C   0.3095 0.1770      0.1951
                                                                  [3] M. Eskevich, G. Jones, C. Wartena, M. Larson, R. Aly,
        scenes-noC  0.3091    0.1767   0.1947                         T. Verschoor, and R. Ordelman. Comparing retrieval
         scenes-S   0.3152    0.1635  0.2021                          effectiveness of alternative content segmentation methods for
          scenes-I  0.2613    0.1444   0.1582                         Internet video search. In Content-Based Multimedia
         scenes-U   0.2458    0.1344   0.1528                         Indexing (CBMI), 2012 10th International Workshop on,
                                                                      pages 1–6, 2012.
         part-sc-C  0.2284    0.1241   0.1024
                                                                  [4] L. Lamel and J.-L. Gauvain. Speech processing for audio
        part-sc-noC 0.2281    0.1240   0.1021                         indexing. In Advances in Natural Language Processing
           cl10-C   0.2929    0.1525   0.1814                         (LNCS 5221), pages 4–15. Springer, 2008.
         cl10-noC   0.2849    0.1479   0.1713                     [5] D. Lin. An Information-Theoretic Definition of Similarity. In
         SW-60-S    0.2833   0.1925 0.2027                            Proceedings of the Fifteenth International Conference on
                                                                      Machine Learning, ICML ’98, pages 296–304, San Francisco,
         SW-60-I    0.1965    0.1206   0.1204
                                                                      CA, USA, 1998. Morgan Kaufmann Publishers Inc.
         SW-40-U    0.2368    0.1342   0.1501                     [6] P. Sidiropoulos, V. Mezaris, and I. Kompatsiaris. Enhancing
                                                                      Video concept detection with the use of tomographs. In
   Overall, the best approaches are those using scenes and            Proceedings of the IEEE International Conference on Image
sliding windows. Scene based approaches retrieve a higher             Processing (ICIP), 2013.
number of correct relevant segments within a time window          [7] P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo,
                                                                      M. Bugalho, and I. Trancoso. Temporal Video Segmentation
of 60 seconds (higher MRR), but they are not the most pre-            to Scenes Using High-Level Audiovisual Features. IEEE
cise in terms of start and end time, compared to the sliding          Transactions on Circuits and Systems for Video Technology,
windows approach (as suggested by mGAP and MASP).                     21(8):1163–1177, Aug. 2011.
                                                                  [8] D. Stein, S. Eickeler, R. Bardeli, E. Apostolidis, V. Mezaris,
4.2    Hyperlinking task                                              and M. Müller. Think Before You Link – Meeting Content
                                                                      Constraints when Linking Television to the Web. In
   The results are listed in Table 2. For both LA and LC con-
                                                                      Proc. NEM Summit, Nantes, France, Oct. 2013. to appear.
dictions, runs using scenes outperform other runs for all met-    [9] S. Tschöpel and D. Schneider. A lightweight keyword and
rics. The MoreLikeThis/Entityclassifier.eu approach comes             tag-cloud retrieval algorithm for automatic speech
second. As expected, using the context increases the pre-             recognition transcripts. In Proceedings of the 11th Annual
cision when hyperlinking video segments. It is also notable           Conference of the International Speech Communication
that the precision at rank n decreases when n increases.              Association ISCA (INTERSPEECH), 2010.