<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Graph-Based Approach and Analysis Framework for Hierarchical Content Browsing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Markus Rickert</string-name>
          <email>markus.rickert@cs.tu-chemitz.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benedikt Etzold</string-name>
          <email>benedikt.etzold@cs.tu-chemitz.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maximilian Eibl</string-name>
          <email>maximilian.eibl@cs.tu-chemitz.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universität Chemnitz</institution>
          ,
          <addr-line>Straße der Nationen 62, D-09111 Chemnitz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Technische Universität Chemnitz</institution>
          ,
          <addr-line>Straße der Nationen 62, D-09111 Chemnitz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Technische Universität Chemnitz</institution>
          ,
          <addr-line>Straße der Nationen 62, D-09111 Chemnitz</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Systems for multimedia retrieval have been object of scientific research for many years. When it comes to present results to the user many solutions disregard the set of problems connected to content delivery. Especially time-constrained results of video retrieval systems need a different visualization. In this paper we present our solution for hierarchical content browsing of video files. Our workflow covers the phases of ingest, transcoding, automatic analysis, intellectual annotation and data aggregation. We describe an algorithm for the graph-based analysis of the content structure in videos. By identifying the requirements of professional users we developed a user interface enabling to access retrieval results in different hierarchical abstraction levels.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION
Compared to other areas of information retrieval, the
content-browsing of audiovisual media bears special challenges.
Videos are time-dependent. Usually the user’s intention is to
find an element inside a video, depicting a certain semantic
concept like a person, topic, location or event. By querying a
video database, the returned result is either a complete video
item or a single element inside a video item determined by
its time position. Professional users are not mainly interested
in finding only a single occurrence of the queued semantic
concept. They want to gather the whole sequence related to
their search query, e.g. to reuse it in a news report or for
historic research. The user usually sees the retrieval result as a
starting point for a further manual searching process inside
the video item which is operated by using the playback and
seek functions of the player software.
In this paper we present our approach to provide a
hierarchical presentation of video items to support professional
users while browsing and consuming the content of a media
retrieval system. Based on the primary focus of video content
from television programs, this solution works best on video
material edited in a post-production workflow. It is not
supposed to be used on e.g. surveillance videos. Our framework
has been developed to provide automatic and intellectual
annotation to historical television recorded on video tapes. The
digitized master copies and their metadata can be searched
and displayed in a web-based user interface (UI). Video shots
and sequences can be explored as a hierarchical structure in
the UI. The system is in use in a pilot project by the “media
state authority of Saxony” (Sächsische Landesmedienanstalt)
in Germany.</p>
      <p>USER REQUIREMENTS &amp; EXISTING WORKFLOWS
Our use case focuses on user groups in professions that rely
heavily on reviewing large amounts of video data on a daily
basis like journalists, editors or historians.</p>
      <p>In a set of interviews we asked a group of experts to describe
their daily work. Thereby, we especially focused on those
areas that deal with the examination of the results of archive
queries. Other fields of interest were the process of querying,
preferred software solutions and the planning of new reports
or videos. Our findings were subsequently merged into an
extensive workflow that was used for identifying different
problem areas.</p>
      <p>
        Altogether, we spoke to three experts from three different
German TV stations, who all work in the field of TV
journalism. Their similar statements and their reports on the
workflows of other professionals and institutions give reason to
believe that our workflow is representative for a significant
part in this field of work. Conducting surveys and interviews
[
        <xref ref-type="bibr" rid="ref15 ref16">17, 18</xref>
        ], we identified some of the main problems they face
as a part of their working routine:
 Metadata is often either fragmentary, or missing
completely. While standards or recommendations exist in
most professions, they are usually ignored due to
bottlenecks in time and personnel.
 Video data is normally stored in its final state, e.g. a film
that has already been edited in post-production. In the
case of search queries returning more than one result,
users often receive a single file containing a queue of all
relevant video files.
 In TV production, time pressure is always high because
of narrow schedules and the need for instant coverage of
current events.
      </p>
      <p>Specific software solutions addressing these issues do not yet
exist in professional scenarios. This leads to a highly
inefficient workflow: Precision rates are usually low because of
the described storing modalities and the lack of precise
metadata. Therefore, numerous files of comparatively large
size have to be inspected in a short period of time.
Classical User Interfaces
The software that is used is normally designed to handle the
simple consumption of video content (e.g. VLC Media
Player or Apple QuickTime) or the tasks of professional
post-production (e.g. Avid MediaComposer or Adobe
Premiere). Both approaches are based on a perspective that
emphasizes the linear structure of the completed video whilst or
after the process of editing. By showing an ordered sequence
of single shots, they present the content in consideration of
the editor’s intention but not of the needs of an expert using
a retrieval system.</p>
      <p>Requirements
Based on these findings, we compiled a list of requirements
that have to be met by a user interface to improve the user
experience significantly:
 Metadata is usable for both video processing and
visualization.
 Information can be displayed based on the video’s
structure.
 Richness of detail can be increased for single segments of
the video.
 The video itself can be accessed through any bit of
information displayed in the UI.
 Relevant segments of the video can be used in later steps
of the user’s workflow, e.g. editing.</p>
      <p>
        FRAMEWORK
Our framework provides functionalities for audio and video
analysis, manual annotation, data warehousing, retrieval and
visualization. It uses specialized components for each aspect.
The core “dispatcher” is controlling the analysis process,
allocation of work units and data aggregation. As deduced
from [
        <xref ref-type="bibr" rid="ref6">8</xref>
        ], the requirements for a scalable analysis system
based on heterogeneous scientific algorithms on the field of
audio and video analysis are complex. The framework is
presented here in its complete workflow for the first time.
Earlier publications covered only aspects of distinct
components. A predecessor partial framework was presented in [3].
Our framework needs to support individual solutions,
programmed in varying languages, based on different operating
system environments and requesting various quantities of
resources. Therefor it runs in an environment of virtual
machines on a cluster of five Intel Xeon dual-quad-core host
servers. The main components were written in C# .Net
source code and make use of service-orientated-architecture
and web services. This provides a redundant and hardware
independent service, while supporting a variety of separate
execution environments for each component. It also allows
for a possible scale-out with additional hardware if needed.
The execution workflow for an individual video tape or file
consists of five phases, as depicted in Figure 1. On the level
of each stage it is intended to reach a maximum of
concurrency.
      </p>
      <p>
        I. Digitization and Transcoding
The very first step is an incoming control of each video tape
and the generation of a unique identifier. Our id system
consists of a 12-byte block and can be represented and displayed
for human reading as a combination of 12 hexadecimal digits
(e.g. 0000-0074-0000-0026-Z) in four segments plus a
calculated check character. After the initial logging, the video
tape is digitized with an automatic robot ingest system as
described in [
        <xref ref-type="bibr" rid="ref10">12</xref>
        ]. It is running batch jobs in parallel on up to
six tape players.
      </p>
      <p>
        The resulting digital master file is encoded as a broadband
IMX50 video codec captured in an mxf-container for
archiving and data exchange. As defined by [
        <xref ref-type="bibr" rid="ref6">8</xref>
        ] we create proxy
versions of the archive file by transcoding it. For automatic
annotation, analysis and as a preview video for the web UI,
we use an h.264 codec at level 4.1 wrapped in an
mp4-container.
      </p>
      <p>
        II. Automatic Analysis and Annotation
The created analysis proxy video is transferred to the analysis
cluster. The dispatcher schedules the analysis of each video
file as a sequence of consecutive analysis steps. For
performance reasons, each component can be instantiated multiple
times. In the common configuration, the system runs with up
to 12 individual virtual machines. The analysis components
are controlled by the dispatcher via a web-service interfaces.
Shot detection component
The shot detection is the first component in the workflow. It
provides a segmentation of the continuous video stream into
parts of uninterrupted camera recordings (shots). The
algorithms developed by [
        <xref ref-type="bibr" rid="ref7">9</xref>
        ] and [
        <xref ref-type="bibr" rid="ref8">10</xref>
        ] are based on calculating the
cumulated error rate of individual motion vectors for each
image block between two successive frames. The
component’s output is a list of metadata for every detected shot. Key
frames of the shot are extracted for use in the UI and
successive components.
      </p>
      <p>
        Face detection component
The face detection component uses the key frames from the
shot detection to mark bounding boxes around each detected
face. The used algorithm is optimized for high precision and
developed by [
        <xref ref-type="bibr" rid="ref8">10</xref>
        ]. It is specialized on data corpora from
local television broadcasts. Its result data is a set of metadata
of the bounding box around detected faces and a sample
image for each detected face.
      </p>
      <p>
        Text extraction component
The text extraction component detects areas of overlay text
boxes within the video steam. The algorithm by [
        <xref ref-type="bibr" rid="ref9">11</xref>
        ] uses a
weighted discrete cosine transform (DCT) to detect
macroblocks by finding regions located in the medium frequency
spectrum. By normalizing the eigenvalues, a mask is
calculated which is used to separate the textbox from the rest of
the image. For text to character transformation the software
tesseract-ocr is used
(https://code.google.com/p/tesseractocr/). The component creates key frame samples of the
detected textboxes, metadata about the locations of the
textboxes and the extracted text from the OCR (optical character
recognition).
      </p>
      <p>
        Speech Recognition
The Speech Recognition component makes use of the
speaker change recognition method described by [
        <xref ref-type="bibr" rid="ref4">6</xref>
        ] and
extended by [3]. It provides data for the differentiation of
individual voices and pauses. By applying Gaussian Mixture
Models individual speaker can be trained and recognized.
The detected utterances of individual speakers are transferred
to an automatic speech recognition (ASR) software. The
resulting data provides not only the recognized words. It adds
metadata about the time position and duration of the
utterance and an id code for identification and re-recognition of
the speaker.
      </p>
      <p>III. Intellectual Annotation
This framework is not only used for demonstration of our
solutions. It is in productive use for archiving historical tape
based material. This constitutes the need for additional
intellectual annotation, since today’s automatic annotation can
provide support, but it cannot substitute the intellectual work
of a human entirely. Secondly, the manual annotated
metadata is used as training sets and test sets for the
development of new algorithms. Therefor we collect metadata for
1 (REM http://rmd.dra.de/remid.php?id=REM_APR_3)
each video tape in form of classical intellectual annotation as
it is already implemented in media archives.</p>
      <p>Scene &amp; Topic Annotation
We developed a web-based annotation tool for the
intellectual annotation of the analyzed video files. To support the
professional user, the tool makes use of the detected video
shots. The video is presented in slices of camera shots. The
video player repeats the current shot in a loop. This makes it
easier for the user to fill out all input fields, without
constantly dealing with the player controls. When the user is
finished with the shot, he can jump to the next. The user marks
the boundaries of storyline sequences as collections of
multiple shots and adds a variety of bibliographical metadata like
title and type of video content, topics and subjects in terms
of individuals, locations, institutions, things, creations (like
art) and other metadata useful for either information retrieval
or as development test data.</p>
      <p>IV. Data Aggregation
In the past we were only analyzing video assets for isolated
scientific experiments. To process large quantities of videos
now, the integration of results from different analysis
algorithms becomes a key challenge. For our environment of use
cases, a data-warehouse solution is needed to aggregate more
than only the results of video analysis. On the one hand it
needs to incorporate the metadata supplied from its sources,
like production information and data from TV broadcasters.
On the other hand it has to provide its data as an export
artifact witch is compatible with formats and conventions used
by achieving facilities and institutes.</p>
      <p>A special challenge was to find a scheme, which complies
with the way video content producers and archives structure
their data, and includes technical data, like feature-vectors
and audiovisual classifications. Our selected
databasescheme is adapted from a common standard for video
documentation1 developed for the German public television. We
combined metadata fields from the point of mandatory and
optional meta-data classes with the goal to maintain a
maximum of compatibility.</p>
      <p>V. Content Delivery and Visualization
For data exchange and archiving, the digital master file, the
proxy files and the metadata are exported to a LTO tape
library. Search and content access for the user are provided by
a webserver.</p>
      <p>The user interface is used for web-based intellectual
annotation, controlling the analysis process, information retrieval
and for content browsing. The UI is able to handle multiple
tenants and has as scalable interface for different display
resolutions or devices. Each function runs in its own web-app.
GRAPH-BASED VIDEO CLUSTERING
During the analysis and automatic annotation we extract
segments of camera shots from the video stream. This
shot-segmentation is helpful for content browsing, but it suffers from
over-segmentation. The structure is too detailed for the
visualization of the actions inside a video. The user needs to be
able to search for scenes or sequences as basic units.
Procedure Sequence-Graph
input: List of detected shot-boundaries and transitions  ℎ, list of
sequences  .
output: Sequence-graph  1(  ,   )
1. for each detected shot and transition  ℎ from  ℎ do
add new vertex   to  1
add new edge   to  1 connecting   and   +1
4. end for each
5. for each sequence   from  do</p>
      <p>create new   in  1
7. end for each
8. for each   from  1do
if   belongs to sequence   then
remove   and it’s out-edges and in-edge from  1
add   and it’s edges as sub-elements to the  
2.
3.
6.
9.
10.
11.
12.
15.
18.
19.
20.
13. end for each
14. for each   removed from  1
16. end for each
17. for each</p>
      <p>from  1do</p>
      <p>edge in  1 connecting   with the
predecessors resp. successors of  
remove all duplicates and increment the weight of  
24, 25]. Shots are represented as nodes, transitions as edges.
Shots with a high similarity are clustered into group-nodes.
This process leads to a digraph with cycles.</p>
      <p>Description
Singular Edges – Directed edge between two
Singular Nodes (Ns) representing the transition from a
camera shot to its successor in the sequence of the
video.</p>
      <p>Singular Nodes – A single continuous camera shot.</p>
      <p>Aggregated Edge – Directed edge between two
Aggregated Nodes (Na) or between a Singular Node
and an Aggregated Node. It represents a set of
interrelated Singular Nodes, respectively a sub-graph
containing a scene in the video.</p>
      <p>Aggregated Node – A group of Singular or
Aggregated Nodes as a sequence or sub-graph.</p>
      <p>Color-Similarity-Group ( )</p>
      <p>— A list of shots,
grouped by its visual similarity. The similarity is
measured
by
a
combination
of the</p>
      <p>MPEGdescriptors Edge-Histogram (EHD) and
Color-Layout (CLD). [10 pp.169]
Sequence-List ( ) – A List of shots, grouped by
their affiliation to a sequence, found by intellectual
annotation. A sequence represents a segment of
continuous action or location in a video.
In order to access the video content in a graph based
hierarchical structure, we create a directed graph to represent the
video’s shots and sequences. The vertices belonging to a
sequence are aggregated to build a second level in the
hierarchy. Metadata created during the intellectual annotation
performs the aggregation.</p>
      <p>Procedure Similarity-Graph
input: List of Color-Similarity-Groups  , graph  1
output: Sequence-Graph with Similarity-Subgraphs  2
1. for each</p>
      <p>from  1 do
create new temporal graph  
for each similarity group   from  do
if one or more sub-vertex   of 
 is ∈   then
add new group-vertex   to  
2.
3.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
30.</p>
      <p>for each sub-edge 
∪ 
from</p>
      <p>do
add new edge   to   connecting the corresponding
vertex of its sources group-vertex 
and its targets
group-vertex</p>
      <p>, respectively the non-group-vertex  if
source or target is not part of an similarity group  
.
end for each
Calculate the strongly connected components of  
for each strongly connected component 
 from   do
create new similarity-vertex 
for each shot-vertex   from   do</p>
      <p>as sub-element in  
if</p>
      <p>then
remove   and it’s out-edges and in-edge from  
add   and it’s edges as sub-elements to the  
remove all duplicates and increment the weight of  
created on the second level, representing the chain of shots
forming a sequence.</p>
      <p>
        Similarity-Graph-Algorithm
One important feature of videos from film and television is
the presence of recurring images. This happens especially
when interviews or dialogs are recorded where the same
individuals are shown several times. In terms of film grammar
this is called the shot-/ reverse-shot method. See Figure 4
Resulting Graph Structure
The final resulting graph represents the video in a
hierarchical structure. On the first level all sequences and all
standalone shots can be accessed. By selecting a sequence all
shots and similarity groups inside the selected sequence can
be accessed. If a shot shows a similar image multiple times,
each instance of this image is aggregated to a group.
Recurring shots are recognizable by cyclic structures of the edges.
On selecting a similarity group the individual instances of the
similar shots can be accessed. The results of the two
clustering-steps and the final 3-layer graph are shown in Figure 5.
Figure 6 shows die visualization of a single layer as used in
the UI.
UI approaches with the purpose of addressing structures in
video content have been developed mainly in in the fields of
film studies and in human-computer-interaction (HCI). They
representing all content sequences as aggregated nodes and
the remaining singular nodes not belonging to a sequence on
the first level. Inside each aggregated node a sub-graph was
normally focus on certain key aspects like analysis,
description [4] or summarization [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] of content. From their
perspective, the temporal order of a video’s single sequences is an
important bit of information and therefore one of the
fundamental principles of their modus operandi.
      </p>
      <p>By shifting the main focus to the video’s structure, we
managed to design a user interface that makes it possible to
quickly overlook a whole file without losing any detail.
Graph-based User Interface
In order to avoid the issues reported by our user groups, as
described above, we decided to organize all available
information in a way that emphasizes the video’s structure.
Richness of detail is increased from top (overview) to bottom (all
details and metadata). The presented metadata-types are
summarized in Table 2. The following interface description
is connected to the layers presented in the Figures 6 and 7.
I. Video player – The player can be used to examine the
single segments in any intended way. In order to provide
permanent availability, it remains at the top of the screen
when scrolling to the lower parts of the UI.</p>
      <p>II. Current graph – Its nodes represent either a single shot
group or a cluster of related groups. By using a simple
directed graph for the top level, we were able to display
all nodes in a familiar left-to-right-order. Every node
contains a representative image sample and some basic
information on its content. The existence of child graphs is
color coded (blue) on this level of detail.</p>
      <p>III. Collapsible container that is used to display a more
granular child graph belonging to a certain top-level node.
IV. Queue – Nodes can be transferred in a drag-and-drop
operated queue of cards that offer a more detailed view of
their content. Furthermore, they can be used to manage a
collection of shots or shot groups that can be watched
directly or exported for further use e.g. in editing software.
V. Details-view – shows all data that is available for one of
the cards. It consists of several lines displaying key
frames, detected faces, off text and text overlays.</p>
      <p>EVALUATION
We performed a first evaluation of our approach by using a
combination of baseline tests and questionnaires. Therefore,
we designed a set of tasks comparable to those described by
our group of experts. A screenshot of the graph-based UI is
depicted in Figure 7. The content used for evaluation consists
of real television news programs, produced during the early
to mid-1990s. It was archived on VHS video tapes. The
actual test-set was composed by randomly selecting 1377
minutes of this video material.</p>
      <p>Four expert users were asked to perform searching tasks.
They were given short descriptions of 27 randomly picked
video sequences with durations between 5 seconds and 10
minutes. The task was to find the described sequences in the
corresponding video file and to write down the time codes of
the sequence boundaries. Searching tasks like these are quite
comparable to the real live work of video editors, because
video content in tape based archives is only marginally
documented. Manual content browsing in a video player and
non-linear editing software (NLE) is used to find sequences
of video content reusable in new video clips.</p>
      <p>For comparison, the searching tasks were performed by using
our graph-based user interface, VLC Media Player and
Adobe Premiere Pro (CS 6). For each task the time needed
for completion was recorded. Overall, 108 different search
operations were performed. Furthermore, differences in the
accuracy of the time-codes were taken into account. With the
graph-based UI, the average duration per searching task was
93 seconds. When searching with VLC (average: 122 s) and
Premiere Pro (average: 179s) significantly more time was
needed (Figure 8). As a result our graph-based solution
outperformed VLC and Premiere Pro. In VLC 27.8% more time
was needed. Searching in Premiere Pro needed 48.5% more
time. One reason for the weak performance of Premiere Pro
could be the zoom function. It was heavily used by the
testers, but leaded to longer searching times.</p>
      <p>One disadvantage of the graph-based UI turned out to be the
fact that entities and events inside a video shot cannot be
isolated. They are bound to the boundaries of the surrounding
shot and cannot be exported independently. In terms of
perceiving the actual structure of the video, all users reported
gaining a deeper understanding when using our approach
than when using VLC or Premiere Pro.</p>
      <p>
        FUTURE WORK
The next step for the analysis and graph-based clustering will
be the substitution of the manual annotation of video
sequences by an automatic sequence segmentation algorithm.
Surveys on the state-of-the-art in video-segmentation
indicate that a multimodal fusion of the analysis results can be
used to cluster successive shots into video sequences. Most
approaches use visual similarity features. But as discussed in
[
        <xref ref-type="bibr" rid="ref5">7</xref>
        ], concepts and rules from the production of video content
can be useful to find sequences or scenes inside video
content.
      </p>
      <p>The graph-based user interface will be evaluated in
additional user tests, exploring if its use is beneficial for
non-professional users as well. A second study will evaluate, which
text-based metadata should be presented at the different
elements to comply the need of the users. Currently, extensions
of the UI are under development to enable a sync-function. It
will allow adapting the presented graph elements when the
current position in the video shifts to the next sequence. This
will give the UI a two-sided interaction between the video
player and the graph structure.</p>
      <p>CONCLUSION
In this paper we presented our concept of a hierarchical
presentation of video items in a graph-based structure. We
described our framework which incorporates video and audio
analysis, intellectual annotation and graph analysis to
construct a multi-layer structure for content-consumption. Our
web-based UI shows how classical sequential content
browsing in videos can be extended to incorporate the inner
structures and relations of the video’s sub-elements.</p>
      <p>ACKNOWLEDGMENTS
Parts of this work were accomplished in the research project
validAX funded by the German Federal Ministry of
Education and Research.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Adami</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , Benini s.,
          <string-name>
            <surname>Leonardi</surname>
            <given-names>R.</given-names>
          </string-name>
          <article-title>An overview of video shot clustering and summarization techniques for mobile applications</article-title>
          .
          <source>In Proc MobiMedia '06</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2006</year>
          ), No. 27
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Knauf</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kürsten</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kurze</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ritter</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berger</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heinich</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eibl</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Produce</surname>
          </string-name>
          . annotate. archive. repurpose --
          <article-title>: accelerating the composition and metadata accumulation of tv content</article-title>
          .
          <source>In Proc. AIEMPro'11</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          ),
          <fpage>30</fpage>
          -
          <lpage>36</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Korte</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <article-title>Einführung in die systematische Filmanalyse</article-title>
          .
          <source>Schmidt</source>
          (
          <year>1999</year>
          ),
          <fpage>40</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , H.-J.
          <article-title>Speaker change detection and tracking in real-time news broadcasting analysis</article-title>
          .
          <source>In Proc MULTIMEDIA'02</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2002</year>
          ),
          <fpage>602</fpage>
          -
          <lpage>610</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Rickert</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Eibl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>A proposal for a taxonomy of semantic editing devices to support semantic classification</article-title>
          .
          <source>In Proc. RACS</source>
          <year>2014</year>
          . ACM (
          <year>2014</year>
          ),
          <fpage>34</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Rickert</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Eibl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Evaluation of media analysis and information retrieval solutions for audio-visual content through their integration in realistic workflows of the broadcast industry</article-title>
          .
          <source>In Proc. RACS</source>
          <year>2013</year>
          . ACM Press (
          <year>2013</year>
          ),
          <fpage>118</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Eibl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>An Extensible Tool for the Annotation of Videos Using Segmentation and Tracking</article-title>
          .
          <source>Lecture Notes in Computer Science</source>
          . Springer Berlin Heidelberg. 295-
          <fpage>304</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>Marc.</given-names>
          </string-name>
          <article-title>Optimierung von Algorithmen zur Videoanalyse</article-title>
          .
          <source>Chemnitz</source>
          (
          <year>2013</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>336</lpage>
          .
          <source>ISBN 978-3- 944640-09-9</source>
          ,
          <fpage>119</fpage>
          -
          <lpage>144</lpage>
          ,
          <fpage>187</fpage>
          -
          <lpage>213</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Heinich</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Textdetektion und -extraktion mit gewichteter DCT und mehrwertiger Bildzerlegung</article-title>
          ,
          <source>In Proc. WAM</source>
          <year>2009</year>
          ,
          <string-name>
            <surname>TU-Chemnitz</surname>
          </string-name>
          (
          <year>2009</year>
          ),
          <fpage>151</fpage>
          -
          <lpage>162</lpage>
          , ISBN 978-3-
          <fpage>000278</fpage>
          -58-7
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Manthey</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herms</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ritter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Storz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eibl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>A Support Framework for Automated Video and Multimedia Workflows for Production and Archive</article-title>
          .
          <source>Proc. HCI International 2013</source>
          . Springer (
          <year>2013</year>
          ),
          <fpage>336</fpage>
          -
          <lpage>341</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Del Fabro</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Böszörmenyi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>State-of-the-art and future challenges in video scene detection: a survey</article-title>
          .
          <source>Journal Multimedia systems</source>
          Vol.
          <volume>19</volume>
          . Issue 5. Springer (
          <year>2013</year>
          ),
          <fpage>427</fpage>
          -
          <lpage>454</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Yeung</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boon-Lock</surname>
            <given-names>Y</given-names>
          </string-name>
          .
          <article-title>Time-constrained clustering for segmentation of video into story units</article-title>
          ,
          <source>In Proc. 13th IC on Pattern Recognition</source>
          , IEEE (
          <year>1996</year>
          ),
          <fpage>375</fpage>
          -
          <lpage>380</lpage>
          vol.
          <volume>3</volume>
          , doi: 10.1109/ICPR.
          <year>1996</year>
          .
          <volume>546973</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Ngo</surname>
            ,
            <given-names>C.-W.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>Y.-F.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          , H.-J.
          <article-title>Video summarization and scene detection by graph modeling</article-title>
          .
          <source>In Circuits and Systems for Video Technology</source>
          vol.
          <volume>15</volume>
          , issue 2,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2005</year>
          ),
          <fpage>296</fpage>
          -
          <lpage>305</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Hanjalic</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lagendijk</surname>
            ,
            <given-names>R. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Biemond</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Automated high-level movie segmentation for advanced video-retrieval systems</article-title>
          .
          <source>In Circuits and Systems for Video Technology</source>
          vol.
          <volume>9</volume>
          ,
          <string-name>
            <surname>issue</surname>
            <given-names>4</given-names>
          </string-name>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>1999</year>
          ),
          <fpage>580</fpage>
          -
          <lpage>5</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kürsten</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Eibl,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Visual String of Reformulation</article-title>
          .
          <source>In Proc HCI International</source>
          , Springer (
          <year>2009</year>
          )
          <article-title>- LNCS 5618</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Berger</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          :
          <article-title>Design Thinking for Search User Interface Design</article-title>
          .
          <source>In Proc EuroHCIR2011</source>
          , Newcastle, (
          <year>2011</year>
          ),
          <fpage>38</fpage>
          -
          <lpage>41</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Kwon</surname>
            ,
            <given-names>Y.-M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
          </string-name>
          , C.-J. and
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>I.-J.</given-names>
          </string-name>
          <article-title>A new approach for high level video structuring</article-title>
          .
          <source>In Proc. Multimedia and Expo</source>
          ,
          <string-name>
            <surname>ICME</surname>
          </string-name>
          <year>2000</year>
          ,
          <volume>773</volume>
          -
          <fpage>776</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <article-title>Automatic segmentation of news items based on video and audio features</article-title>
          .
          <source>In Proc. Advances in Multimedia Information Processing, PCM</source>
          <year>2001</year>
          , LNCS
          <volume>2195</volume>
          ,
          <fpage>498</fpage>
          -
          <lpage>505</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Vendrig</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Worring</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Systematic evaluation of logical story unit segmentation</article-title>
          .
          <source>Multimedia. In IEEE Transactions on. 4</source>
          ,
          <string-name>
            <surname>4</surname>
            <given-names>IEEE</given-names>
          </string-name>
          (
          <year>2002</year>
          ),
          <fpage>492</fpage>
          -
          <lpage>499</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Malik</surname>
            ,
            <given-names>J. Normalized</given-names>
          </string-name>
          <string-name>
            <surname>Cuts</surname>
            and
            <given-names>Image</given-names>
          </string-name>
          <string-name>
            <surname>Segmentation</surname>
          </string-name>
          .
          <source>In IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          .
          <volume>22</volume>
          ,
          <string-name>
            <surname>8</surname>
            <given-names>IEEE</given-names>
          </string-name>
          (
          <year>2000</year>
          ),
          <fpage>888</fpage>
          -
          <lpage>905</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Sidiropoulos</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mezaris</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kompatsiaris</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meinedo</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          and
          <article-title>Trancoso, I. Multi-modal scene segmentation using scene transition graphs</article-title>
          .
          <source>In Proc. MM '09 of the 17th ACM international conference on Multimedia.ACM</source>
          (
          <year>2009</year>
          ),
          <fpage>665</fpage>
          -
          <lpage>668</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ding</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Graphbased multi-modal scene detection for movie and teleplay</article-title>
          .
          <source>In Proc. Acoustics</source>
          , Speech and
          <string-name>
            <given-names>Signal</given-names>
            <surname>Processing</surname>
          </string-name>
          (ICASSP)
          <year>2012</year>
          . IEEE (
          <year>2012</year>
          ),
          <fpage>1413</fpage>
          -
          <lpage>1416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Porteous</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benini</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Canini</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charles</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cavazza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Leonardi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <year>2010</year>
          .
          <article-title>Interactive storytelling via video content recombination</article-title>
          .
          <source>In Proc. MM '10 of the 17th ACM international conference on Multimedia. ACM</source>
          (
          <year>2010</year>
          ),
          <fpage>1715</fpage>
          -
          <lpage>1718</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>