=Paper= {{Paper |id=Vol-233/paper-29 |storemode=property |title=Multimedia Content Processing and Retrieval in the REVEAL THIS setting |pdfUrl=https://ceur-ws.org/Vol-233/p59.pdf |volume=Vol-233 |dblpUrl=https://dblp.org/rec/conf/samt/PiperidisPPNGTCBM06 }} ==Multimedia Content Processing and Retrieval in the REVEAL THIS setting== https://ceur-ws.org/Vol-233/p59.pdf
            Multimedia Content Processing and Retrieval
                   in the REVEAL THIS setting
             Stelios Piperidis1, Harris Papageorgiou1, Katerina Pastra1, Thomas Netousek2, Eric Gaussier3, Tinne
                                  Tuytelaars4, Fabio Crestani5, Francis Bodson6, Chris Mellor7


                                                                                  Web, TV and/or Radio content is fed into the REVEAL
   Abstract— The explosion of multimedia digital content and                   THIS prototype, it is analysed, indexed, categorized,
the development of technologies that go beyond traditional                     summarized and stored in an archive. This content can be
broadcast and TV have rendered access to such content                          searched and/or pushed to a user according to his/her
important for all end-users of these technologies. REVEAL                      interests. Novice and advanced computer users are targeted;
THIS develops content processing technology able to
                                                                               they can both access the system through the web and perform
semantically index, categorise and cross-link multiplatform,
multimedia and multilingual digital content, providing the
                                                                               simple or advanced searches respectively. Furthermore,
system user with search, retrieval, summarisation and                          mobile phone access to the system is possible through GPRS
translation functionalities.                                                   or Wireless Lan connection to the system’s mobile phone
                                                                               server. In this case, the system’s role is more proactive, in
  Index Terms—audio-image-text analysis, cross-media linking                   that it pushes information to the user according to the user’s
and indexing, cross-media categorisation, cross-media                          profile. EU politics, news and travel data are handled by the
summarisation, cross-lingual translation                                       system in English and Greek.
                          I. INTRODUCTION                                                    II. THE REVEAL THIS SYSTEM

T    HE development of methods and tools for content-based
     organization and filtering of the large amount of
                                                                                  As depicted in Figure 1, the REVEAL THIS system
                                                                               comprises a number of single-media and multimedia
multimedia information that reaches the user is a key issue                    technologies that can be grouped, for presentation purposes,
for its effective consumption. Despite recent technological                    in the following subsystems: (i) Content Analysis & Indexing
progress in the new media and the Internet, the key issue                      (CAIS), (ii) Cross-media Categorisation (CCS), (iii) Cross-
remains “how digital technology could add value to                             media Summarisation (CSS), (iv) Cross-lingual Translation
information channels and systems” [1].                                         (CLTS), and (v) Cross-media Content Access and Retrieval.
   REVEAL THIS aims at answering this question by
tackling the following scientific and technological challenges:
• enrichment of multilingual multimedia content with
  semantic information like topics, speakers, actors, facts,
  categories
• establishment of semantic links between pieces of
  information presented in different media and languages
• development      of    cross-media      categorization   and
  summarization engines
• deployment of cross-language information retrieval and
  machine translation to allow users to search for and retrieve
  information according to their language preferences.



The REVEAL THIS project (www.reveal-this.org) is a thirty-month-STREP
project funded by the FP6-IST programme of the European Commission,
contract No FP6-IST-511689. It is designed and implemented by the REVEAL
THIS consortium comprising Institute for Language and Speech Processing (Co-               Figure 1 : REVEAL THIS system workflow
ordinator), SAIL LABS Technology AG, Xerox-The Document Company
S.A.S, Katholieke Universiteit Leuven R&D, University of Strathclyde, BeTV     A. Cross-media Content Analysis
SA and TVEyes UK Ltd.
                                                                               The CAIS subsystem consists of technologies and components
1
 Institute for Language and Speech Processing, Athens, Greece, 2SAIL LABS      for medium-specific analysis:
Technology AG, 3 Xerox - The Document Company S.A.S, 4 Katholieke
Universiteit Leuven R&D, 5 University of Strathclyde, 6BeTV SA, 7 TVEyes UK    • Speech processing (SPC) – involving speech recognition,
Ltd                                                                              speaker identification and speaker turn detection
• Image analysis and categorization (IAC)- involving shot         invoked, for creating structured views of a file; currently, it is
  and keyframe extraction, low-level visual feature extraction,   news programmes that can be browsed in such a way,
  image categorisation [2]                                        allowing the user to watch all “anchor”, “interview”, and
• Face analysis (FDIC) – involving face recognition &             “reportage” segments. The module labels all shots of a file
  identification [3]                                              accordingly, by exploiting a belief propagation network.
• Text processing (TPC)- involving named entity, term & fact      Finally, the CSS brings all these pieces of information
  extraction, topic detection                                     together,       providing        visualisation        interfaces
• Cross-media Indexing (CMIC) – catering for the                  (SMIL/HTML+TIME) that enable the user to preview
  establishment of links between all above-mentioned              multimedia objects effectively, before downloading them.
  metadata for a multimedia file using a modified TF-IDF and        D. Cross-lingual Translation
  a Dempster-Shafer based approach [4]
The metadata/indices produced by the above components are         The CLTS subsystem allows users to query documents written
aligned, synchronized, linked to the corresponding points of      in different languages, to categorise content expressed in
the source material (text, audio and video) and encoded in        different languages and to preview language specific
MPEG7. Information suggested by audio processing (speaker         summaries. A bilingual lexicon extraction module is used to
turns) and topic detection is taken into account to segment the   generate lexical equivalences for query translation purposes,
audiovisual or audio files into segments, or what one could       but also to replace keywords in a target language, in case a
call “stories” i.e. thematic sections of the document.            document is linguistically not well formed (e.g. output from a
Categorisation, summarisation and translation of multimedia       speech recognizer) and thus, not effectively translated. Last, a
documents themselves make use of part of these metadata.          statistical machine translation module is responsible for
                                                                  providing translations of the textual part of the summaries
 B. Cross-media categorisation                                    produced by the Cross-Media Summarization Subsystem.
   The categorization subsystem considers documents                 E. Usability Evaluation
containing not only text or images but a combination of
                                                                  Apart from technical evaluation of the system components (cf
different types of media (text, image, speech, video). A
                                                                  [2], [3], [5], [7]), the integrated REVEAL THIS prototype is
multiple-view fusion method is adopted, which builds 'on top'
                                                                  currently being evaluated by prospective users following a
of two single-media categorizers, a textual and an image
                                                                  task-based approach. A pool of about 30 users for each
categorizer, without the need to re-train them. Data annotated
                                                                  application (pull and push) has been created. Based on typical
manually for both textual and image categories is used for
                                                                  search sessions of these users, appropriate search tasks have
training the cross-media categorizer. In that set, dependencies
                                                                  been created for the users to undertake using the REVEAL
between single-media category systems are exploited in order
                                                                  THIS prototype. Feedback from the evaluation will guide the
to refine the categorization decisions made [5].
                                                                  final system refinements. The technology developed is
  C. Cross-media summarisation                                    envisaged to contribute to a content management platform
The cross-media summarisation subsystem (CSS) determines          that can be used by content providers, to add value to their
and presents the most salient parts according to the users’       content, and directly by end users, for accessing multimedia
profiles and interests by fusing video, audio and textual         information.
metadata. It comprises three major components: the textual-
based summarization (TS), the visual-based summarization                                         REFERENCES
(VS), and the cross-media summarization components,               [1]   K. Pastra and S. Piperidis, "Video Search: New Challenges in the
aiming at fusing the two analyses and creating a self-                  Pervasive Digital Video Era", Journal of Virtual Reality and
                                                                        Broadcasting, in press
contained object. Building on the MEAD development                [2]   F. Perronnin, C. Dance, G. Csurka, M. Bressan, “Adapted Vocabularies
platform [6], the TS component extracts the top-ranked                  for Generic Visual Categorization”, European Conference on Computer
sentences of a story: for each sentence, a salience score is            Vision (ECCV), Graz, Austria, 2006
                                                                  [3]   M. De Smet, R. Fransens, L. Van Gool, "A generalised EM approach for
computed as a weighted sum of summary-worthy features.                  3D model based face recognition under occlusions" , in Proceedings of
  The VS component comprises the scene segmentation,                    the Computer Vision and Pattern Recognition Conference (CVPR), New
scene clustering & labelling modules [6]. The scene                     York, USA, 2006
                                                                  [4]   M. Yakici and F. Crestani, "Cross-media Indexing in the Reveal This
segmenter segments the video sequence into scenes. Scene                prototype", in Proceedings of the LREC workshop on "Crossing media for
boundaries are detected as local minima in the visual                   improved information access", Genoa, Italy, 2006
coherence function with each scene corresponding, ideally, to     [5]   J.Renders, E.Gaussier, C.Goutte, F.Pacull, G.Csurka, "Categorization in
a story of the video. Scene clustering caters for simple                multiple category systems", Proceedings of the 23rd International
                                                                        Conference on Machine Learning (ICML), Pittsburgh, USA, 2006
applications that need a few indicative images. Keyframes of      [6]   B. Georgantopoulos, T. Goedeme, S. Lounis, H. Papageorgiou, T.
the scene are clustered into larger parts, from which a                 Tuytelaars, L. Van Gool, "Cross-media summarization in a retrieval
prototypical image is chosen. Clustering is repeated                    setting", in Proceedings of the LREC 2006 workshop on "Crossing media
                                                                        for improved information access", Genoa, Italy, 2006
iteratively to acquire a hierarchical cluster tree. The           [7]   M. Simard, N. Cancedda, B. Cavestro, M. Dymetman, E. Gaussier, C.
prototypes of these clusters can be seen as representative              Goutte, K. Yamada, P. Langlais and A. Mauser , "Translating with Non-
images of the scene. Going a step further, scene labelling is           contiguous Phrases", In Proceedings of HLT/EMNLP, Vancouver,
                                                                        Canada, 2005