=Paper=
{{Paper
|id=Vol-233/paper-29
|storemode=property
|title=Multimedia Content Processing and Retrieval in the REVEAL THIS setting
|pdfUrl=https://ceur-ws.org/Vol-233/p59.pdf
|volume=Vol-233
|dblpUrl=https://dblp.org/rec/conf/samt/PiperidisPPNGTCBM06
}}
==Multimedia Content Processing and Retrieval in the REVEAL THIS setting==
Multimedia Content Processing and Retrieval
in the REVEAL THIS setting
Stelios Piperidis1, Harris Papageorgiou1, Katerina Pastra1, Thomas Netousek2, Eric Gaussier3, Tinne
Tuytelaars4, Fabio Crestani5, Francis Bodson6, Chris Mellor7
Web, TV and/or Radio content is fed into the REVEAL
Abstract— The explosion of multimedia digital content and THIS prototype, it is analysed, indexed, categorized,
the development of technologies that go beyond traditional summarized and stored in an archive. This content can be
broadcast and TV have rendered access to such content searched and/or pushed to a user according to his/her
important for all end-users of these technologies. REVEAL interests. Novice and advanced computer users are targeted;
THIS develops content processing technology able to
they can both access the system through the web and perform
semantically index, categorise and cross-link multiplatform,
multimedia and multilingual digital content, providing the
simple or advanced searches respectively. Furthermore,
system user with search, retrieval, summarisation and mobile phone access to the system is possible through GPRS
translation functionalities. or Wireless Lan connection to the system’s mobile phone
server. In this case, the system’s role is more proactive, in
Index Terms—audio-image-text analysis, cross-media linking that it pushes information to the user according to the user’s
and indexing, cross-media categorisation, cross-media profile. EU politics, news and travel data are handled by the
summarisation, cross-lingual translation system in English and Greek.
I. INTRODUCTION II. THE REVEAL THIS SYSTEM
T HE development of methods and tools for content-based
organization and filtering of the large amount of
As depicted in Figure 1, the REVEAL THIS system
comprises a number of single-media and multimedia
multimedia information that reaches the user is a key issue technologies that can be grouped, for presentation purposes,
for its effective consumption. Despite recent technological in the following subsystems: (i) Content Analysis & Indexing
progress in the new media and the Internet, the key issue (CAIS), (ii) Cross-media Categorisation (CCS), (iii) Cross-
remains “how digital technology could add value to media Summarisation (CSS), (iv) Cross-lingual Translation
information channels and systems” [1]. (CLTS), and (v) Cross-media Content Access and Retrieval.
REVEAL THIS aims at answering this question by
tackling the following scientific and technological challenges:
• enrichment of multilingual multimedia content with
semantic information like topics, speakers, actors, facts,
categories
• establishment of semantic links between pieces of
information presented in different media and languages
• development of cross-media categorization and
summarization engines
• deployment of cross-language information retrieval and
machine translation to allow users to search for and retrieve
information according to their language preferences.
The REVEAL THIS project (www.reveal-this.org) is a thirty-month-STREP
project funded by the FP6-IST programme of the European Commission,
contract No FP6-IST-511689. It is designed and implemented by the REVEAL
THIS consortium comprising Institute for Language and Speech Processing (Co- Figure 1 : REVEAL THIS system workflow
ordinator), SAIL LABS Technology AG, Xerox-The Document Company
S.A.S, Katholieke Universiteit Leuven R&D, University of Strathclyde, BeTV A. Cross-media Content Analysis
SA and TVEyes UK Ltd.
The CAIS subsystem consists of technologies and components
1
Institute for Language and Speech Processing, Athens, Greece, 2SAIL LABS for medium-specific analysis:
Technology AG, 3 Xerox - The Document Company S.A.S, 4 Katholieke
Universiteit Leuven R&D, 5 University of Strathclyde, 6BeTV SA, 7 TVEyes UK • Speech processing (SPC) – involving speech recognition,
Ltd speaker identification and speaker turn detection
• Image analysis and categorization (IAC)- involving shot invoked, for creating structured views of a file; currently, it is
and keyframe extraction, low-level visual feature extraction, news programmes that can be browsed in such a way,
image categorisation [2] allowing the user to watch all “anchor”, “interview”, and
• Face analysis (FDIC) – involving face recognition & “reportage” segments. The module labels all shots of a file
identification [3] accordingly, by exploiting a belief propagation network.
• Text processing (TPC)- involving named entity, term & fact Finally, the CSS brings all these pieces of information
extraction, topic detection together, providing visualisation interfaces
• Cross-media Indexing (CMIC) – catering for the (SMIL/HTML+TIME) that enable the user to preview
establishment of links between all above-mentioned multimedia objects effectively, before downloading them.
metadata for a multimedia file using a modified TF-IDF and D. Cross-lingual Translation
a Dempster-Shafer based approach [4]
The metadata/indices produced by the above components are The CLTS subsystem allows users to query documents written
aligned, synchronized, linked to the corresponding points of in different languages, to categorise content expressed in
the source material (text, audio and video) and encoded in different languages and to preview language specific
MPEG7. Information suggested by audio processing (speaker summaries. A bilingual lexicon extraction module is used to
turns) and topic detection is taken into account to segment the generate lexical equivalences for query translation purposes,
audiovisual or audio files into segments, or what one could but also to replace keywords in a target language, in case a
call “stories” i.e. thematic sections of the document. document is linguistically not well formed (e.g. output from a
Categorisation, summarisation and translation of multimedia speech recognizer) and thus, not effectively translated. Last, a
documents themselves make use of part of these metadata. statistical machine translation module is responsible for
providing translations of the textual part of the summaries
B. Cross-media categorisation produced by the Cross-Media Summarization Subsystem.
The categorization subsystem considers documents E. Usability Evaluation
containing not only text or images but a combination of
Apart from technical evaluation of the system components (cf
different types of media (text, image, speech, video). A
[2], [3], [5], [7]), the integrated REVEAL THIS prototype is
multiple-view fusion method is adopted, which builds 'on top'
currently being evaluated by prospective users following a
of two single-media categorizers, a textual and an image
task-based approach. A pool of about 30 users for each
categorizer, without the need to re-train them. Data annotated
application (pull and push) has been created. Based on typical
manually for both textual and image categories is used for
search sessions of these users, appropriate search tasks have
training the cross-media categorizer. In that set, dependencies
been created for the users to undertake using the REVEAL
between single-media category systems are exploited in order
THIS prototype. Feedback from the evaluation will guide the
to refine the categorization decisions made [5].
final system refinements. The technology developed is
C. Cross-media summarisation envisaged to contribute to a content management platform
The cross-media summarisation subsystem (CSS) determines that can be used by content providers, to add value to their
and presents the most salient parts according to the users’ content, and directly by end users, for accessing multimedia
profiles and interests by fusing video, audio and textual information.
metadata. It comprises three major components: the textual-
based summarization (TS), the visual-based summarization REFERENCES
(VS), and the cross-media summarization components, [1] K. Pastra and S. Piperidis, "Video Search: New Challenges in the
aiming at fusing the two analyses and creating a self- Pervasive Digital Video Era", Journal of Virtual Reality and
Broadcasting, in press
contained object. Building on the MEAD development [2] F. Perronnin, C. Dance, G. Csurka, M. Bressan, “Adapted Vocabularies
platform [6], the TS component extracts the top-ranked for Generic Visual Categorization”, European Conference on Computer
sentences of a story: for each sentence, a salience score is Vision (ECCV), Graz, Austria, 2006
[3] M. De Smet, R. Fransens, L. Van Gool, "A generalised EM approach for
computed as a weighted sum of summary-worthy features. 3D model based face recognition under occlusions" , in Proceedings of
The VS component comprises the scene segmentation, the Computer Vision and Pattern Recognition Conference (CVPR), New
scene clustering & labelling modules [6]. The scene York, USA, 2006
[4] M. Yakici and F. Crestani, "Cross-media Indexing in the Reveal This
segmenter segments the video sequence into scenes. Scene prototype", in Proceedings of the LREC workshop on "Crossing media for
boundaries are detected as local minima in the visual improved information access", Genoa, Italy, 2006
coherence function with each scene corresponding, ideally, to [5] J.Renders, E.Gaussier, C.Goutte, F.Pacull, G.Csurka, "Categorization in
a story of the video. Scene clustering caters for simple multiple category systems", Proceedings of the 23rd International
Conference on Machine Learning (ICML), Pittsburgh, USA, 2006
applications that need a few indicative images. Keyframes of [6] B. Georgantopoulos, T. Goedeme, S. Lounis, H. Papageorgiou, T.
the scene are clustered into larger parts, from which a Tuytelaars, L. Van Gool, "Cross-media summarization in a retrieval
prototypical image is chosen. Clustering is repeated setting", in Proceedings of the LREC 2006 workshop on "Crossing media
for improved information access", Genoa, Italy, 2006
iteratively to acquire a hierarchical cluster tree. The [7] M. Simard, N. Cancedda, B. Cavestro, M. Dymetman, E. Gaussier, C.
prototypes of these clusters can be seen as representative Goutte, K. Yamada, P. Langlais and A. Mauser , "Translating with Non-
images of the scene. Going a step further, scene labelling is contiguous Phrases", In Proceedings of HLT/EMNLP, Vancouver,
Canada, 2005