=Paper=
{{Paper
|id=Vol-1516/p6
|storemode=property
|title=A Graph-Based Approach and Analysis Framework for Hierarchical Browsing of Video Content
|pdfUrl=https://ceur-ws.org/Vol-1516/p6.pdf
|volume=Vol-1516
|dblpUrl=https://dblp.org/rec/conf/tvx/RickertEE15
}}
==A Graph-Based Approach and Analysis Framework for Hierarchical Browsing of Video Content==
A Graph-Based Approach and Analysis Framework
for Hierarchical Content Browsing
Markus Rickert Benedikt Etzold Maximilian Eibl
Technische Universität Chemnitz Technische Universität Chemnitz Technische Universität Chemnitz
Straße der Nationen 62 Straße der Nationen 62 Straße der Nationen 62
D-09111 Chemnitz, Germany D-09111 Chemnitz, Germany D-09111 Chemnitz, Germany
markus.rickert@cs.tu-chemitz.de benedikt.etzold@cs.tu-chemitz.de maximilian.eibl@cs.tu-chemitz.de
ABSTRACT In this paper we present our approach to provide a hierar-
Systems for multimedia retrieval have been object of scien- chical presentation of video items to support professional us-
tific research for many years. When it comes to present re- ers while browsing and consuming the content of a media
sults to the user many solutions disregard the set of problems retrieval system. Based on the primary focus of video content
connected to content delivery. Especially time-constrained from television programs, this solution works best on video
results of video retrieval systems need a different visualiza- material edited in a post-production workflow. It is not sup-
tion. In this paper we present our solution for hierarchical posed to be used on e.g. surveillance videos. Our framework
content browsing of video files. Our workflow covers the has been developed to provide automatic and intellectual an-
phases of ingest, transcoding, automatic analysis, intellectual notation to historical television recorded on video tapes. The
annotation and data aggregation. We describe an algorithm digitized master copies and their metadata can be searched
for the graph-based analysis of the content structure in vid- and displayed in a web-based user interface (UI). Video shots
eos. By identifying the requirements of professional users we and sequences can be explored as a hierarchical structure in
developed a user interface enabling to access retrieval results the UI. The system is in use in a pilot project by the “media
in different hierarchical abstraction levels. state authority of Saxony” (Sächsische Landesmedienanstalt)
Author Keywords in Germany.
Content-browsing, video analysis, video retrieval, graph- USER REQUIREMENTS & EXISTING WORKFLOWS
based analysis, visualization, algorithm, user interface Our use case focuses on user groups in professions that rely
ACM Classification Keywords heavily on reviewing large amounts of video data on a daily
E.1 [Data Structures]: Graphs and networks, H.5.1 [Infor- basis like journalists, editors or historians.
mation Interfaces and Presentation]: Multimedia, I.2.10 In a set of interviews we asked a group of experts to describe
[Vision and Scene Understanding]: Video analysis, their daily work. Thereby, we especially focused on those ar-
INTRODUCTION eas that deal with the examination of the results of archive
Compared to other areas of information retrieval, the con- queries. Other fields of interest were the process of querying,
tent-browsing of audiovisual media bears special challenges. preferred software solutions and the planning of new reports
Videos are time-dependent. Usually the user’s intention is to or videos. Our findings were subsequently merged into an
find an element inside a video, depicting a certain semantic extensive workflow that was used for identifying different
concept like a person, topic, location or event. By querying a problem areas.
video database, the returned result is either a complete video Altogether, we spoke to three experts from three different
item or a single element inside a video item determined by German TV stations, who all work in the field of TV journal-
its time position. Professional users are not mainly interested ism. Their similar statements and their reports on the work-
in finding only a single occurrence of the queued semantic flows of other professionals and institutions give reason to
concept. They want to gather the whole sequence related to believe that our workflow is representative for a significant
their search query, e.g. to reuse it in a news report or for his- part in this field of work. Conducting surveys and interviews
toric research. The user usually sees the retrieval result as a [17, 18], we identified some of the main problems they face
starting point for a further manual searching process inside as a part of their working routine:
the video item which is operated by using the playback and
seek functions of the player software. Metadata is often either fragmentary, or missing com-
pletely. While standards or recommendations exist in
most professions, they are usually ignored due to bottle-
necks in time and personnel.
"3rd International Workshop on Interactive Content Consumption
at TVX’15, June 3rd, 2015, Brussels, Belgium. Video data is normally stored in its final state, e.g. a film
Copyright is held by the author(s)/owner(s)." that has already been edited in post-production. In the
case of search queries returning more than one result, us-
ers often receive a single file containing a queue of all
relevant video files.
In TV production, time pressure is always high because and web services. This provides a redundant and hardware
of narrow schedules and the need for instant coverage of independent service, while supporting a variety of separate
current events. execution environments for each component. It also allows
for a possible scale-out with additional hardware if needed.
Specific software solutions addressing these issues do not yet
exist in professional scenarios. This leads to a highly ineffi- The execution workflow for an individual video tape or file
cient workflow: Precision rates are usually low because of consists of five phases, as depicted in Figure 1. On the level
the described storing modalities and the lack of precise of each stage it is intended to reach a maximum of concur-
metadata. Therefore, numerous files of comparatively large rency.
size have to be inspected in a short period of time.
Classical User Interfaces
The software that is used is normally designed to handle the
simple consumption of video content (e.g. VLC Media
Player or Apple QuickTime) or the tasks of professional
post-production (e.g. Avid MediaComposer or Adobe Prem-
iere). Both approaches are based on a perspective that em-
phasizes the linear structure of the completed video whilst or
after the process of editing. By showing an ordered sequence
of single shots, they present the content in consideration of
the editor’s intention but not of the needs of an expert using
a retrieval system.
Requirements
Based on these findings, we compiled a list of requirements
that have to be met by a user interface to improve the user
experience significantly:
Metadata is usable for both video processing and visuali-
zation.
Information can be displayed based on the video’s struc-
ture. Figure 1: Framework Workflow in its five Phases.
Richness of detail can be increased for single segments of
the video. I. Digitization and Transcoding
The very first step is an incoming control of each video tape
The video itself can be accessed through any bit of infor-
and the generation of a unique identifier. Our id system con-
mation displayed in the UI.
sists of a 12-byte block and can be represented and displayed
Relevant segments of the video can be used in later steps
for human reading as a combination of 12 hexadecimal digits
of the user’s workflow, e.g. editing.
(e.g. 0000-0074-0000-0026-Z) in four segments plus a cal-
FRAMEWORK culated check character. After the initial logging, the video
Our framework provides functionalities for audio and video tape is digitized with an automatic robot ingest system as de-
analysis, manual annotation, data warehousing, retrieval and scribed in [12]. It is running batch jobs in parallel on up to
visualization. It uses specialized components for each aspect. six tape players.
The core “dispatcher” is controlling the analysis process, al-
location of work units and data aggregation. As deduced The resulting digital master file is encoded as a broadband
from [8], the requirements for a scalable analysis system IMX50 video codec captured in an mxf-container for archiv-
based on heterogeneous scientific algorithms on the field of ing and data exchange. As defined by [8] we create proxy
audio and video analysis are complex. The framework is pre- versions of the archive file by transcoding it. For automatic
sented here in its complete workflow for the first time. Ear- annotation, analysis and as a preview video for the web UI,
lier publications covered only aspects of distinct compo- we use an h.264 codec at level 4.1 wrapped in an mp4-con-
nents. A predecessor partial framework was presented in [3]. tainer.
II. Automatic Analysis and Annotation
Our framework needs to support individual solutions, pro- The created analysis proxy video is transferred to the analysis
grammed in varying languages, based on different operating cluster. The dispatcher schedules the analysis of each video
system environments and requesting various quantities of re- file as a sequence of consecutive analysis steps. For perfor-
sources. Therefor it runs in an environment of virtual ma- mance reasons, each component can be instantiated multiple
chines on a cluster of five Intel Xeon dual-quad-core host times. In the common configuration, the system runs with up
servers. The main components were written in C# .Net to 12 individual virtual machines. The analysis components
source code and make use of service-orientated-architecture are controlled by the dispatcher via a web-service interfaces.
Shot detection component each video tape in form of classical intellectual annotation as
The shot detection is the first component in the workflow. It it is already implemented in media archives.
provides a segmentation of the continuous video stream into
Scene & Topic Annotation
parts of uninterrupted camera recordings (shots). The algo-
We developed a web-based annotation tool for the intellec-
rithms developed by [9] and [10] are based on calculating the
tual annotation of the analyzed video files. To support the
cumulated error rate of individual motion vectors for each
professional user, the tool makes use of the detected video
image block between two successive frames. The compo-
shots. The video is presented in slices of camera shots. The
nent’s output is a list of metadata for every detected shot. Key
video player repeats the current shot in a loop. This makes it
frames of the shot are extracted for use in the UI and succes-
easier for the user to fill out all input fields, without con-
sive components.
stantly dealing with the player controls. When the user is fin-
Face detection component ished with the shot, he can jump to the next. The user marks
The face detection component uses the key frames from the the boundaries of storyline sequences as collections of mul-
shot detection to mark bounding boxes around each detected tiple shots and adds a variety of bibliographical metadata like
face. The used algorithm is optimized for high precision and title and type of video content, topics and subjects in terms
developed by [10]. It is specialized on data corpora from lo- of individuals, locations, institutions, things, creations (like
cal television broadcasts. Its result data is a set of metadata art) and other metadata useful for either information retrieval
of the bounding box around detected faces and a sample im- or as development test data.
age for each detected face.
IV. Data Aggregation
Text extraction component In the past we were only analyzing video assets for isolated
The text extraction component detects areas of overlay text scientific experiments. To process large quantities of videos
boxes within the video steam. The algorithm by [11] uses a now, the integration of results from different analysis algo-
weighted discrete cosine transform (DCT) to detect macro- rithms becomes a key challenge. For our environment of use
blocks by finding regions located in the medium frequency cases, a data-warehouse solution is needed to aggregate more
spectrum. By normalizing the eigenvalues, a mask is calcu- than only the results of video analysis. On the one hand it
lated which is used to separate the textbox from the rest of needs to incorporate the metadata supplied from its sources,
the image. For text to character transformation the software like production information and data from TV broadcasters.
tesseract-ocr is used (https://code.google.com/p/tesseract- On the other hand it has to provide its data as an export arti-
ocr/). The component creates key frame samples of the de- fact witch is compatible with formats and conventions used
tected textboxes, metadata about the locations of the text- by achieving facilities and institutes.
boxes and the extracted text from the OCR (optical character
recognition). A special challenge was to find a scheme, which complies
with the way video content producers and archives structure
Speech Recognition their data, and includes technical data, like feature-vectors
The Speech Recognition component makes use of the and audiovisual classifications. Our selected database-
speaker change recognition method described by [6] and ex- scheme is adapted from a common standard for video docu-
tended by [3]. It provides data for the differentiation of indi- mentation1 developed for the German public television. We
vidual voices and pauses. By applying Gaussian Mixture combined metadata fields from the point of mandatory and
Models individual speaker can be trained and recognized. optional meta-data classes with the goal to maintain a maxi-
The detected utterances of individual speakers are transferred mum of compatibility.
to an automatic speech recognition (ASR) software. The re-
sulting data provides not only the recognized words. It adds V. Content Delivery and Visualization
metadata about the time position and duration of the utter- For data exchange and archiving, the digital master file, the
ance and an id code for identification and re-recognition of proxy files and the metadata are exported to a LTO tape li-
the speaker. brary. Search and content access for the user are provided by
a webserver.
III. Intellectual Annotation The user interface is used for web-based intellectual annota-
This framework is not only used for demonstration of our so- tion, controlling the analysis process, information retrieval
lutions. It is in productive use for archiving historical tape and for content browsing. The UI is able to handle multiple
based material. This constitutes the need for additional intel- tenants and has as scalable interface for different display res-
lectual annotation, since today’s automatic annotation can olutions or devices. Each function runs in its own web-app.
provide support, but it cannot substitute the intellectual work
of a human entirely. Secondly, the manual annotated
metadata is used as training sets and test sets for the devel-
opment of new algorithms. Therefor we collect metadata for
1
(REM http://rmd.dra.de/remid.php?id=REM_APR_3)
GRAPH-BASED VIDEO CLUSTERING Description
During the analysis and automatic annotation we extract seg- Singular Edges – Directed edge between two Sin-
𝐸𝑠
ments of camera shots from the video stream. This shot-seg- gular Nodes (Ns) representing the transition from a
mentation is helpful for content browsing, but it suffers from camera shot to its successor in the sequence of the
over-segmentation. The structure is too detailed for the visu- video.
alization of the actions inside a video. The user needs to be Singular Nodes – A single continuous camera shot.
able to search for scenes or sequences as basic units.
𝑉𝑠
𝐸𝑎 Aggregated Edge – Directed edge between two Ag-
Procedure Sequence-Graph gregated Nodes (Na) or between a Singular Node
input: List of detected shot-boundaries and transitions 𝑆ℎ, list of and an Aggregated Node. It represents a set of in-
sequences 𝑆𝑞. terrelated Singular Nodes, respectively a sub-graph
output: Sequence-graph 𝐺1 (𝐸𝑎 , 𝑉𝑎 ) containing a scene in the video.
1. for each detected shot and transition 𝑠ℎ𝑖 from 𝑆ℎ do Aggregated Node – A group of Singular or Aggre-
2. add new vertex 𝑉𝑠𝑖 to 𝐺1 𝑉𝑎
gated Nodes as a sequence or sub-graph.
3. add new edge 𝐸𝑠𝑖 to 𝐺1 connecting 𝑉𝑠𝑖 and 𝑉𝑠𝑖+1
𝐶 Color-Similarity-Group (𝐶) — A list of shots,
4. end for each
grouped by its visual similarity. The similarity is
5. for each sequence 𝑠𝑞𝑗 from 𝑆𝑞 do
measured by a combination of the MPEG-
6. create new 𝑉𝑎𝑗 in 𝐺1
descriptors Edge-Histogram (EHD) and Color-Lay-
7. end for each out (CLD). [10 pp.169]
8. for each 𝑉𝑠𝑖 from 𝐺1 do
𝑆𝑞 Sequence-List (𝑆𝑞) – A List of shots, grouped by
9. if 𝑉𝑠𝑖 belongs to sequence 𝑠𝑞𝑗 then
their affiliation to a sequence, found by intellectual
10. remove 𝑉𝑠𝑖 and it’s out-edges and in-edge from 𝐺1 annotation. A sequence represents a segment of con-
11. add 𝑉𝑠𝑖 and it’s edges as sub-elements to the 𝑉𝑎𝑗 tinuous action or location in a video.
12. end if
13. end for each Table 1: Data structures.
14. for each 𝐸𝑠𝑖 removed from 𝐺1
Metadata & Parameter
15. add new 𝐸𝑎𝑖 edge in 𝐺1 connecting 𝑉𝑎𝑗 with the predeces-
sors resp. successors of 𝑉𝑠𝑖 Duration of the transition.
16. end for each 𝐸𝑠 Type of transition (cut, wipe, dissolve, fade).
17. for each 𝐸𝑎𝑘 from 𝐺1 do As described in the taxonomy by [7]
18. if more than one 𝐸𝑎 exists with the same source-vertex Number of the shot.
and the same target-vertex as 𝐸𝑎𝑘 than Times of start, duration and end of the shot
19. remove all duplicates and increment the weight of 𝐸𝑎𝑘 Extracted keyframes of the first and last frame.
20. end if
𝑉𝑠
Extracted keyframes from face detection
21. end for each
Data from text extraction
Figure 2: Procedure to create a Sequence-Graph.
𝐸𝑎 Weight.
Different approaches for clustering or grouping of related
shots were published. A detailed survey on the field of video A representative keyframe.
segmentation is given by [13]. Start-time of the earliest sub-element.
𝑉𝑎 End-time of the latest sub-element.
A common strategy in many clustering approaches is to find Metadata of the speech recognition.
structures and similarities in the given video. The similarity Annotation: topic, location, subjects, individuals etc.
measurement can be based on classification of e.g. motion
vectors, dominant color, edge histogram and editing tempo. Table 2: Metadata available in the data structures.
By calculating the similarity of consecutive shots, groups can Data Structure
be identified. “Overlapping-links” introduced by [16] was Our proposed solution is derived from the concept of shot-
one of the early strategies to find structures inside of videos. transition-graphs. We use a weighted directed graph for the
It was extended by [19, 20]. The algorithm can cluster similar representation of hierarchical sequence structures in a video.
shots and the shots laying in between as a Logical shot units Edges represent transitions between distinct shots or se-
(LSUs) [21]. quences. Nodes represent single shots or sub-segments with
Our solution was inspired by overlapping-links, the concept a new graph of shots inside. See Table 1.
of a Scene-Transition-Graph (STG) [14, 22] and the Scene Sequence-Graph-Algorithm
Detection solution published by [15]. These approaches are In order to access the video content in a graph based hierar-
still subject to actual publication and optimizations like [23, chical structure, we create a directed graph to represent the
24, 25]. Shots are represented as nodes, transitions as edges. video’s shots and sequences. The vertices belonging to a
Shots with a high similarity are clustered into group-nodes.
This process leads to a digraph with cycles.
Figure 3: Visualization of vertices and edges of the Sequence-Graph
sequence are aggregated to build a second level in the hier- created on the second level, representing the chain of shots
archy. Metadata created during the intellectual annotation forming a sequence.
performs the aggregation.
Similarity-Graph-Algorithm
Procedure Similarity-Graph One important feature of videos from film and television is
input: List of Color-Similarity-Groups 𝐶, graph 𝐺1 the presence of recurring images. This happens especially
output: Sequence-Graph with Similarity-Subgraphs 𝐺2 when interviews or dialogs are recorded where the same in-
1. for each 𝑉𝑎𝑖 from 𝐺1 do dividuals are shown several times. In terms of film grammar
2. create new temporal graph 𝐺𝑡𝑖 this is called the shot-/ reverse-shot method. See Figure 4
3. for each similarity group 𝑠𝑞𝑗 from 𝑆𝑞 do
Resulting Graph Structure
4. if one or more sub-vertex 𝑉𝑠𝑖 of 𝑉𝑎𝑖 is ∈ 𝑠𝑞𝑖 then
5. add new group-vertex 𝑉𝑎𝑗 to 𝐺𝑡𝑗
The final resulting graph represents the video in a hierar-
6. end if
chical structure. On the first level all sequences and all
7. end for each standalone shots can be accessed. By selecting a sequence all
8. for each sub-vertex 𝑉𝑠𝑖 of 𝑉𝑎𝑖 not from 𝑠𝑞𝑖 do shots and similarity groups inside the selected sequence can
9. add new non-group-vertex 𝑉𝑠𝑗 to 𝐺𝑡𝑗 be accessed. If a shot shows a similar image multiple times,
10. end for each each instance of this image is aggregated to a group. Recur-
11. for each sub-edge 𝐸𝑠 ∪ 𝐸𝑎 from 𝑉𝑎𝑖 do ring shots are recognizable by cyclic structures of the edges.
12. add new edge 𝐸𝑎𝑗 to 𝐺𝑡𝑗 connecting the corresponding On selecting a similarity group the individual instances of the
vertex of its sources group-vertex 𝑉𝑎 and its targets similar shots can be accessed. The results of the two cluster-
group-vertex 𝑉𝑎, respectively the non-group-vertex 𝑉𝑠 if ing-steps and the final 3-layer graph are shown in Figure 5.
source or target is not part of an similarity group 𝑠𝑞𝑖 . Figure 6 shows die visualization of a single layer as used in
13. end for each the UI.
14. Calculate the strongly connected components of 𝐺𝑡𝑗
15. for each strongly connected component 𝑠𝑐𝑐𝑘 from 𝐺𝑡𝑗 do
16. create new similarity-vertex 𝑉𝑎𝑘 as sub-element in 𝑉𝑎𝑖
17. for each shot-vertex 𝑉𝑠𝑙 from 𝑉𝑎𝑖 do
18. if 𝑉𝑠𝑙 𝑖𝑠 𝑚𝑒𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑐𝑐𝑘 then
19. remove 𝑉𝑠𝑙 and it’s out-edges and in-edge from 𝑉𝑎𝑖
20. add 𝑉𝑎𝑙 and it’s edges as sub-elements to the 𝑉𝑎𝑘
21. end if
22. end for each
23. for each 𝐸𝑠𝑚 removed from 𝑉𝑎𝑖 do
24. add new 𝐸𝑎𝑚 edge in 𝑉𝑎𝑘 connecting 𝑉𝑎𝑘 with it’s
predecessors resp. successors
25. end for each
26. end for each
27. for each edge 𝐸𝑎𝑚 from 𝑉𝑎𝑖 do
28. if more than one 𝐸𝑎 exists with the same source-vertex and
the same target-vertex as 𝐸𝑎𝑘 than
29. remove all duplicates and increment the weight of 𝐸𝑎𝑚
30. end if
31. end for each
32. end for each
Figure 5: 𝑮𝟏 (Sequence-Graph),
Figure 4: Similarity-Graph Procedure 𝑮𝟐 (Similarity-Graph and Sequence-Graph)
The resulting Sequence-Graph (Algorithm in Figure 2) is GRAPH-BASED USER INTERFACE
representing all content sequences as aggregated nodes and UI approaches with the purpose of addressing structures in
the remaining singular nodes not belonging to a sequence on video content have been developed mainly in in the fields of
the first level. Inside each aggregated node a sub-graph was film studies and in human-computer-interaction (HCI). They
normally focus on certain key aspects like analysis, descrip- V. Details-view – shows all data that is available for one of
tion [4] or summarization [1] of content. From their perspec- the cards. It consists of several lines displaying key
tive, the temporal order of a video’s single sequences is an frames, detected faces, off text and text overlays.
important bit of information and therefore one of the funda-
EVALUATION
mental principles of their modus operandi. We performed a first evaluation of our approach by using a
By shifting the main focus to the video’s structure, we man- combination of baseline tests and questionnaires. Therefore,
aged to design a user interface that makes it possible to we designed a set of tasks comparable to those described by
quickly overlook a whole file without losing any detail. our group of experts. A screenshot of the graph-based UI is
depicted in Figure 7. The content used for evaluation consists
Graph-based User Interface
of real television news programs, produced during the early
In order to avoid the issues reported by our user groups, as
to mid-1990s. It was archived on VHS video tapes. The ac-
described above, we decided to organize all available infor-
tual test-set was composed by randomly selecting 1377
mation in a way that emphasizes the video’s structure. Rich-
minutes of this video material.
ness of detail is increased from top (overview) to bottom (all
details and metadata). The presented metadata-types are Four expert users were asked to perform searching tasks.
summarized in Table 2. The following interface description They were given short descriptions of 27 randomly picked
is connected to the layers presented in the Figures 6 and 7. video sequences with durations between 5 seconds and 10
minutes. The task was to find the described sequences in the
I. Video player – The player can be used to examine the
corresponding video file and to write down the time codes of
single segments in any intended way. In order to provide
the sequence boundaries. Searching tasks like these are quite
permanent availability, it remains at the top of the screen
comparable to the real live work of video editors, because
when scrolling to the lower parts of the UI.
video content in tape based archives is only marginally doc-
II. Current graph – Its nodes represent either a single shot
umented. Manual content browsing in a video player and
group or a cluster of related groups. By using a simple
non-linear editing software (NLE) is used to find sequences
directed graph for the top level, we were able to display
of video content reusable in new video clips.
all nodes in a familiar left-to-right-order. Every node con-
tains a representative image sample and some basic in-
formation on its content. The existence of child graphs is
color coded (blue) on this level of detail.
III. Collapsible container that is used to display a more gran-
ular child graph belonging to a certain top-level node.
IV. Queue – Nodes can be transferred in a drag-and-drop op-
erated queue of cards that offer a more detailed view of
their content. Furthermore, they can be used to manage a
collection of shots or shot groups that can be watched di-
rectly or exported for further use e.g. in editing software.
Figure 7: Schematic View of the UI. Figure 6: Multilayer-View of a Graph-based UI.
text-based metadata should be presented at the different ele-
Average Searching Time (seconds) ments to comply the need of the users. Currently, extensions
of the UI are under development to enable a sync-function. It
Premiere Pro will allow adapting the presented graph elements when the
CS6 current position in the video shifts to the next sequence. This
VLC Media will give the UI a two-sided interaction between the video
Player player and the graph structure.
Graph-based UI CONCLUSION
In this paper we presented our concept of a hierarchical
0 50 100 150 200
presentation of video items in a graph-based structure. We
described our framework which incorporates video and audio
VLC analysis, intellectual annotation and graph analysis to con-
Graph- Premiere
Media struct a multi-layer structure for content-consumption. Our
based UI Pro CS6
Player web-based UI shows how classical sequential content brows-
Average Searching ing in videos can be extended to incorporate the inner struc-
93 122 179
Time (seconds)
tures and relations of the video’s sub-elements.
ACKNOWLEDGMENTS
Figure 8: Evaluation results. Parts of this work were accomplished in the research project
For comparison, the searching tasks were performed by using validAX funded by the German Federal Ministry of Educa-
our graph-based user interface, VLC Media Player and tion and Research.
Adobe Premiere Pro (CS 6). For each task the time needed REFERENCES
for completion was recorded. Overall, 108 different search [1] Adami, N., Benini s., Leonardi R. An overview of
operations were performed. Furthermore, differences in the video shot clustering and summarization techniques
accuracy of the time-codes were taken into account. With the for mobile applications. In Proc MobiMedia '06, ACM
graph-based UI, the average duration per searching task was (2006), No. 27
93 seconds. When searching with VLC (average: 122 s) and
[3] Knauf, R., Kürsten, J., Kurze, A., Ritter M., Berger A.,
Premiere Pro (average: 179s) significantly more time was
Heinich S., Eibl M. Produce. annotate. archive. repur-
needed (Figure 8). As a result our graph-based solution out-
pose --: accelerating the composition and metadata ac-
performed VLC and Premiere Pro. In VLC 27.8% more time
cumulation of tv content. In Proc. AIEMPro’11, ACM
was needed. Searching in Premiere Pro needed 48.5% more
(2011), 30-36
time. One reason for the weak performance of Premiere Pro
could be the zoom function. It was heavily used by the test- [4] Korte, H. Einführung in die systematische Filmana-
ers, but leaded to longer searching times. lyse. Schmidt (1999), 40.
One disadvantage of the graph-based UI turned out to be the [6] Lu, L. and Zhang, H.-J. Speaker change detection and
fact that entities and events inside a video shot cannot be iso- tracking in real-time news broadcasting analysis. In
lated. They are bound to the boundaries of the surrounding Proc MULTIMEDIA’02, ACM (2002), 602-610.
shot and cannot be exported independently. In terms of per-
ceiving the actual structure of the video, all users reported [7] Rickert, M. and Eibl, M. A proposal for a taxonomy of
gaining a deeper understanding when using our approach semantic editing devices to support semantic classifi-
than when using VLC or Premiere Pro. cation. In Proc. RACS 2014. ACM (2014), 34–39.
FUTURE WORK [8] Rickert, M. and Eibl, M. Evaluation of media analysis
The next step for the analysis and graph-based clustering will and information retrieval solutions for audio-visual
be the substitution of the manual annotation of video se- content through their integration in realistic workflows
quences by an automatic sequence segmentation algorithm. of the broadcast industry. In Proc. RACS 2013. ACM
Surveys on the state-of-the-art in video-segmentation indi- Press (2013), 118–121.
cate that a multimodal fusion of the analysis results can be [9] Ritter, M. and Eibl, M. 2011. An Extensible Tool for
used to cluster successive shots into video sequences. Most the Annotation of Videos Using Segmentation and
approaches use visual similarity features. But as discussed in Tracking. Lecture Notes in Computer Science. Sprin-
[7], concepts and rules from the production of video content ger Berlin Heidelberg. 295–304.
can be useful to find sequences or scenes inside video con-
tent. [10] Ritter, Marc. Optimierung von Algorithmen zur Vide-
oanalyse. Chemnitz (2013), 1–336. ISBN 978-3-
The graph-based user interface will be evaluated in addi- 944640-09-9, 119-144, 187-213
tional user tests, exploring if its use is beneficial for non-pro-
fessional users as well. A second study will evaluate, which
[11] Heinich, S. Textdetektion und -extraktion mit gewich- [19] Kwon, Y.-M., Song, C.-J. and Kim, I.-J. A new ap-
teter DCT und mehrwertiger Bildzerlegung, In Proc. proach for high level video structuring. In Proc. Multi-
WAM 2009, TU-Chemnitz (2009), 151-162, ISBN media and Expo, ICME 2000, 773–776.
978-3-000278-58-7
[20] Wang, W. and Gao, W. Automatic segmentation of
[12] Manthey, R., Herms, R., Ritter, M., Storz, M., Eibl, M. news items based on video and audio features. In Proc.
A Support Framework for Automated Video and Mul- Advances in Multimedia Information Processing,
timedia Workflows for Production and Archive. Proc. PCM 2001, LNCS 2195, 498–505
HCI International 2013. Springer (2013), 336-341.
[21] Vendrig, J. and Worring, M. Systematic evaluation of
[13] Del Fabro, M., Böszörmenyi, L. State-of-the-art and logical story unit segmentation. Multimedia. In IEEE
future challenges in video scene detection: a survey. Transactions on. 4, 4 IEEE (2002), 492–499.
Journal Multimedia systems Vol. 19. Issue 5. Springer
[22] Shi, J. and Malik, J. Normalized Cuts and Image Seg-
(2013), 427–454
mentation. In IEEE Transactions on Pattern Analysis
[14] Yeung, M.M., Boon-Lock Y. Time-constrained clus- and Machine Intelligence. 22, 8 IEEE (2000), 888–905.
tering for segmentation of video into story units, In
[23] Sidiropoulos, P., Mezaris, V., Kompatsiaris, I.,
Proc. 13th IC on Pattern Recognition, IEEE (1996),
Meinedo, H. and Trancoso, I. Multi-modal scene seg-
375-380 vol.3, doi: 10.1109/ICPR.1996.546973.
mentation using scene transition graphs. In Proc. MM
[15] Ngo, C.-W., Ma, Y.-F., Zhang, H.-J. Video summari- '09 of the 17th ACM international conference on Multi-
zation and scene detection by graph modeling. In Cir- media.ACM(2009), 665-668.
cuits and Systems for Video Technology vol. 15, issue
[24] Xu, S., Feng, B., Ding, P. and Xu, B. 2012. Graph-
2, IEEE (2005), 296–305.
based multi-modal scene detection for movie and tele-
[16] Hanjalic, A., Lagendijk, R. L., Biemond, J. Automated play. In Proc. Acoustics, Speech and Signal Processing
high-level movie segmentation for advanced video-re- (ICASSP) 2012. IEEE (2012), 1413–1416.
trieval systems. In Circuits and Systems for Video
[25] Porteous, J., Benini, S., Canini, L., Charles, F.,
Technology vol. 9, issue 4, IEEE (1999), 580-5
Cavazza, M. and Leonardi, R. 2010. Interactive story-
[17] Berger, A.; Kürsten, J.; Eibl, M.: Visual String of Re- telling via video content recombination. In Proc. MM
formulation. In Proc HCI International, Springer '10 of the 17th ACM international conference on Multi-
(2009) - LNCS 5618 media. ACM (2010), 1715–1718.
[18] Berger, A: Design Thinking for Search User Interface
Design. In Proc EuroHCIR2011, Newcastle, (2011),
38-41