A Graph-Based Approach and Analysis Framework
                   for Hierarchical Content Browsing
             Markus Rickert                                       Benedikt Etzold                        Maximilian Eibl
    Technische Universität Chemnitz                  Technische Universität Chemnitz            Technische Universität Chemnitz
        Straße der Nationen 62                            Straße der Nationen 62                    Straße der Nationen 62
     D-09111 Chemnitz, Germany                         D-09111 Chemnitz, Germany                 D-09111 Chemnitz, Germany
    markus.rickert@cs.tu-chemitz.de                  benedikt.etzold@cs.tu-chemitz.de           maximilian.eibl@cs.tu-chemitz.de

ABSTRACT                                                                  In this paper we present our approach to provide a hierar-
Systems for multimedia retrieval have been object of scien-               chical presentation of video items to support professional us-
tific research for many years. When it comes to present re-               ers while browsing and consuming the content of a media
sults to the user many solutions disregard the set of problems            retrieval system. Based on the primary focus of video content
connected to content delivery. Especially time-constrained                from television programs, this solution works best on video
results of video retrieval systems need a different visualiza-            material edited in a post-production workflow. It is not sup-
tion. In this paper we present our solution for hierarchical              posed to be used on e.g. surveillance videos. Our framework
content browsing of video files. Our workflow covers the                  has been developed to provide automatic and intellectual an-
phases of ingest, transcoding, automatic analysis, intellectual           notation to historical television recorded on video tapes. The
annotation and data aggregation. We describe an algorithm                 digitized master copies and their metadata can be searched
for the graph-based analysis of the content structure in vid-             and displayed in a web-based user interface (UI). Video shots
eos. By identifying the requirements of professional users we             and sequences can be explored as a hierarchical structure in
developed a user interface enabling to access retrieval results           the UI. The system is in use in a pilot project by the “media
in different hierarchical abstraction levels.                             state authority of Saxony” (Sächsische Landesmedienanstalt)
Author Keywords                                                           in Germany.
Content-browsing, video analysis, video retrieval, graph-                 USER REQUIREMENTS & EXISTING WORKFLOWS
based analysis, visualization, algorithm, user interface                  Our use case focuses on user groups in professions that rely
ACM Classification Keywords                                               heavily on reviewing large amounts of video data on a daily
E.1 [Data Structures]: Graphs and networks, H.5.1 [Infor-                 basis like journalists, editors or historians.
mation Interfaces and Presentation]: Multimedia, I.2.10                   In a set of interviews we asked a group of experts to describe
[Vision and Scene Understanding]: Video analysis,                         their daily work. Thereby, we especially focused on those ar-
INTRODUCTION                                                              eas that deal with the examination of the results of archive
Compared to other areas of information retrieval, the con-                queries. Other fields of interest were the process of querying,
tent-browsing of audiovisual media bears special challenges.              preferred software solutions and the planning of new reports
Videos are time-dependent. Usually the user’s intention is to             or videos. Our findings were subsequently merged into an
find an element inside a video, depicting a certain semantic              extensive workflow that was used for identifying different
concept like a person, topic, location or event. By querying a            problem areas.
video database, the returned result is either a complete video            Altogether, we spoke to three experts from three different
item or a single element inside a video item determined by                German TV stations, who all work in the field of TV journal-
its time position. Professional users are not mainly interested           ism. Their similar statements and their reports on the work-
in finding only a single occurrence of the queued semantic                flows of other professionals and institutions give reason to
concept. They want to gather the whole sequence related to                believe that our workflow is representative for a significant
their search query, e.g. to reuse it in a news report or for his-         part in this field of work. Conducting surveys and interviews
toric research. The user usually sees the retrieval result as a           [17, 18], we identified some of the main problems they face
starting point for a further manual searching process inside              as a part of their working routine:
the video item which is operated by using the playback and
seek functions of the player software.                                     Metadata is often either fragmentary, or missing com-
                                                                            pletely. While standards or recommendations exist in
                                                                            most professions, they are usually ignored due to bottle-
                                                                            necks in time and personnel.
 "3rd International Workshop on Interactive Content Consumption
  at TVX’15, June 3rd, 2015, Brussels, Belgium.                            Video data is normally stored in its final state, e.g. a film
 Copyright is held by the author(s)/owner(s)."                              that has already been edited in post-production. In the
                                                                            case of search queries returning more than one result, us-
                                                                            ers often receive a single file containing a queue of all
                                                                            relevant video files.
 In TV production, time pressure is always high because          and web services. This provides a redundant and hardware
  of narrow schedules and the need for instant coverage of        independent service, while supporting a variety of separate
  current events.                                                 execution environments for each component. It also allows
                                                                  for a possible scale-out with additional hardware if needed.
Specific software solutions addressing these issues do not yet
exist in professional scenarios. This leads to a highly ineffi-   The execution workflow for an individual video tape or file
cient workflow: Precision rates are usually low because of        consists of five phases, as depicted in Figure 1. On the level
the described storing modalities and the lack of precise          of each stage it is intended to reach a maximum of concur-
metadata. Therefore, numerous files of comparatively large        rency.
size have to be inspected in a short period of time.
Classical User Interfaces
The software that is used is normally designed to handle the
simple consumption of video content (e.g. VLC Media
Player or Apple QuickTime) or the tasks of professional
post-production (e.g. Avid MediaComposer or Adobe Prem-
iere). Both approaches are based on a perspective that em-
phasizes the linear structure of the completed video whilst or
after the process of editing. By showing an ordered sequence
of single shots, they present the content in consideration of
the editor’s intention but not of the needs of an expert using
a retrieval system.
Requirements
Based on these findings, we compiled a list of requirements
that have to be met by a user interface to improve the user
experience significantly:
 Metadata is usable for both video processing and visuali-
  zation.
 Information can be displayed based on the video’s struc-
  ture.                                                                 Figure 1: Framework Workflow in its five Phases.
 Richness of detail can be increased for single segments of
  the video.                                                      I. Digitization and Transcoding
                                                                  The very first step is an incoming control of each video tape
 The video itself can be accessed through any bit of infor-
                                                                  and the generation of a unique identifier. Our id system con-
  mation displayed in the UI.
                                                                  sists of a 12-byte block and can be represented and displayed
 Relevant segments of the video can be used in later steps
                                                                  for human reading as a combination of 12 hexadecimal digits
  of the user’s workflow, e.g. editing.
                                                                  (e.g. 0000-0074-0000-0026-Z) in four segments plus a cal-
FRAMEWORK                                                         culated check character. After the initial logging, the video
Our framework provides functionalities for audio and video        tape is digitized with an automatic robot ingest system as de-
analysis, manual annotation, data warehousing, retrieval and      scribed in [12]. It is running batch jobs in parallel on up to
visualization. It uses specialized components for each aspect.    six tape players.
The core “dispatcher” is controlling the analysis process, al-
location of work units and data aggregation. As deduced           The resulting digital master file is encoded as a broadband
from [8], the requirements for a scalable analysis system         IMX50 video codec captured in an mxf-container for archiv-
based on heterogeneous scientific algorithms on the field of      ing and data exchange. As defined by [8] we create proxy
audio and video analysis are complex. The framework is pre-       versions of the archive file by transcoding it. For automatic
sented here in its complete workflow for the first time. Ear-     annotation, analysis and as a preview video for the web UI,
lier publications covered only aspects of distinct compo-         we use an h.264 codec at level 4.1 wrapped in an mp4-con-
nents. A predecessor partial framework was presented in [3].      tainer.
                                                                  II. Automatic Analysis and Annotation
Our framework needs to support individual solutions, pro-         The created analysis proxy video is transferred to the analysis
grammed in varying languages, based on different operating        cluster. The dispatcher schedules the analysis of each video
system environments and requesting various quantities of re-      file as a sequence of consecutive analysis steps. For perfor-
sources. Therefor it runs in an environment of virtual ma-        mance reasons, each component can be instantiated multiple
chines on a cluster of five Intel Xeon dual-quad-core host        times. In the common configuration, the system runs with up
servers. The main components were written in C# .Net              to 12 individual virtual machines. The analysis components
source code and make use of service-orientated-architecture       are controlled by the dispatcher via a web-service interfaces.
Shot detection component                                           each video tape in form of classical intellectual annotation as
The shot detection is the first component in the workflow. It      it is already implemented in media archives.
provides a segmentation of the continuous video stream into
                                                                   Scene & Topic Annotation
parts of uninterrupted camera recordings (shots). The algo-
                                                                   We developed a web-based annotation tool for the intellec-
rithms developed by [9] and [10] are based on calculating the
                                                                   tual annotation of the analyzed video files. To support the
cumulated error rate of individual motion vectors for each
                                                                   professional user, the tool makes use of the detected video
image block between two successive frames. The compo-
                                                                   shots. The video is presented in slices of camera shots. The
nent’s output is a list of metadata for every detected shot. Key
                                                                   video player repeats the current shot in a loop. This makes it
frames of the shot are extracted for use in the UI and succes-
                                                                   easier for the user to fill out all input fields, without con-
sive components.
                                                                   stantly dealing with the player controls. When the user is fin-
Face detection component                                           ished with the shot, he can jump to the next. The user marks
The face detection component uses the key frames from the          the boundaries of storyline sequences as collections of mul-
shot detection to mark bounding boxes around each detected         tiple shots and adds a variety of bibliographical metadata like
face. The used algorithm is optimized for high precision and       title and type of video content, topics and subjects in terms
developed by [10]. It is specialized on data corpora from lo-      of individuals, locations, institutions, things, creations (like
cal television broadcasts. Its result data is a set of metadata    art) and other metadata useful for either information retrieval
of the bounding box around detected faces and a sample im-         or as development test data.
age for each detected face.
                                                                   IV. Data Aggregation
Text extraction component                                          In the past we were only analyzing video assets for isolated
The text extraction component detects areas of overlay text        scientific experiments. To process large quantities of videos
boxes within the video steam. The algorithm by [11] uses a         now, the integration of results from different analysis algo-
weighted discrete cosine transform (DCT) to detect macro-          rithms becomes a key challenge. For our environment of use
blocks by finding regions located in the medium frequency          cases, a data-warehouse solution is needed to aggregate more
spectrum. By normalizing the eigenvalues, a mask is calcu-         than only the results of video analysis. On the one hand it
lated which is used to separate the textbox from the rest of       needs to incorporate the metadata supplied from its sources,
the image. For text to character transformation the software       like production information and data from TV broadcasters.
tesseract-ocr is used (https://code.google.com/p/tesseract-        On the other hand it has to provide its data as an export arti-
ocr/). The component creates key frame samples of the de-          fact witch is compatible with formats and conventions used
tected textboxes, metadata about the locations of the text-        by achieving facilities and institutes.
boxes and the extracted text from the OCR (optical character
recognition).                                                      A special challenge was to find a scheme, which complies
                                                                   with the way video content producers and archives structure
Speech Recognition                                                 their data, and includes technical data, like feature-vectors
The Speech Recognition component makes use of the                  and audiovisual classifications. Our selected database-
speaker change recognition method described by [6] and ex-         scheme is adapted from a common standard for video docu-
tended by [3]. It provides data for the differentiation of indi-   mentation1 developed for the German public television. We
vidual voices and pauses. By applying Gaussian Mixture             combined metadata fields from the point of mandatory and
Models individual speaker can be trained and recognized.           optional meta-data classes with the goal to maintain a maxi-
The detected utterances of individual speakers are transferred     mum of compatibility.
to an automatic speech recognition (ASR) software. The re-
sulting data provides not only the recognized words. It adds       V. Content Delivery and Visualization
metadata about the time position and duration of the utter-        For data exchange and archiving, the digital master file, the
ance and an id code for identification and re-recognition of       proxy files and the metadata are exported to a LTO tape li-
the speaker.                                                       brary. Search and content access for the user are provided by
                                                                   a webserver.
III. Intellectual Annotation                                       The user interface is used for web-based intellectual annota-
This framework is not only used for demonstration of our so-       tion, controlling the analysis process, information retrieval
lutions. It is in productive use for archiving historical tape     and for content browsing. The UI is able to handle multiple
based material. This constitutes the need for additional intel-    tenants and has as scalable interface for different display res-
lectual annotation, since today’s automatic annotation can         olutions or devices. Each function runs in its own web-app.
provide support, but it cannot substitute the intellectual work
of a human entirely. Secondly, the manual annotated
metadata is used as training sets and test sets for the devel-
opment of new algorithms. Therefor we collect metadata for


1
    (REM http://rmd.dra.de/remid.php?id=REM_APR_3)
GRAPH-BASED VIDEO CLUSTERING                                                                          Description
During the analysis and automatic annotation we extract seg-                      Singular Edges – Directed edge between two Sin-
                                                                       𝐸𝑠
ments of camera shots from the video stream. This shot-seg-                       gular Nodes (Ns) representing the transition from a
mentation is helpful for content browsing, but it suffers from                    camera shot to its successor in the sequence of the
over-segmentation. The structure is too detailed for the visu-                    video.
alization of the actions inside a video. The user needs to be                     Singular Nodes – A single continuous camera shot.
able to search for scenes or sequences as basic units.
                                                                       𝑉𝑠

                                                                       𝐸𝑎         Aggregated Edge – Directed edge between two Ag-
Procedure Sequence-Graph                                                          gregated Nodes (Na) or between a Singular Node
input: List of detected shot-boundaries and transitions 𝑆ℎ, list of               and an Aggregated Node. It represents a set of in-
      sequences 𝑆𝑞.                                                               terrelated Singular Nodes, respectively a sub-graph
output: Sequence-graph 𝐺1 (𝐸𝑎 , 𝑉𝑎 )                                              containing a scene in the video.
1. for each detected shot and transition 𝑠ℎ𝑖 from 𝑆ℎ do                           Aggregated Node – A group of Singular or Aggre-
2.     add new vertex 𝑉𝑠𝑖 to 𝐺1                                        𝑉𝑎
                                                                                  gated Nodes as a sequence or sub-graph.
3.     add new edge 𝐸𝑠𝑖 to 𝐺1 connecting 𝑉𝑠𝑖 and 𝑉𝑠𝑖+1
                                                                       𝐶          Color-Similarity-Group (𝐶) — A list of shots,
4. end for each
                                                                                  grouped by its visual similarity. The similarity is
5. for each sequence 𝑠𝑞𝑗 from 𝑆𝑞 do
                                                                                  measured by a combination of the MPEG-
6.     create new 𝑉𝑎𝑗 in 𝐺1
                                                                                  descriptors Edge-Histogram (EHD) and Color-Lay-
7. end for each                                                                   out (CLD). [10 pp.169]
8. for each 𝑉𝑠𝑖 from 𝐺1 do
                                                                       𝑆𝑞         Sequence-List (𝑆𝑞) – A List of shots, grouped by
9.     if 𝑉𝑠𝑖 belongs to sequence 𝑠𝑞𝑗 then
                                                                                  their affiliation to a sequence, found by intellectual
10.       remove 𝑉𝑠𝑖 and it’s out-edges and in-edge from 𝐺1                       annotation. A sequence represents a segment of con-
11.       add 𝑉𝑠𝑖 and it’s edges as sub-elements to the 𝑉𝑎𝑗                       tinuous action or location in a video.
12.    end if
13. end for each                                                                          Table 1: Data structures.
14. for each 𝐸𝑠𝑖 removed from 𝐺1
                                                                                              Metadata & Parameter
15.    add new 𝐸𝑎𝑖 edge in 𝐺1 connecting 𝑉𝑎𝑗 with the predeces-
       sors resp. successors of 𝑉𝑠𝑖                                            Duration of the transition.
16. end for each                                                       𝐸𝑠      Type of transition (cut, wipe, dissolve, fade).
17. for each 𝐸𝑎𝑘 from 𝐺1 do                                                     As described in the taxonomy by [7]
18.    if more than one 𝐸𝑎 exists with the same source-vertex                  Number of the shot.
       and the same target-vertex as 𝐸𝑎𝑘 than                                  Times of start, duration and end of the shot
19.       remove all duplicates and increment the weight of 𝐸𝑎𝑘                Extracted keyframes of the first and last frame.
20.    end if
                                                                       𝑉𝑠
                                                                               Extracted keyframes from face detection
21. end for each
                                                                               Data from text extraction
       Figure 2: Procedure to create a Sequence-Graph.
                                                                       𝐸𝑎      Weight.
Different approaches for clustering or grouping of related
shots were published. A detailed survey on the field of video                  A representative keyframe.
segmentation is given by [13].                                                 Start-time of the earliest sub-element.
                                                                       𝑉𝑎      End-time of the latest sub-element.
A common strategy in many clustering approaches is to find                     Metadata of the speech recognition.
structures and similarities in the given video. The similarity                 Annotation: topic, location, subjects, individuals etc.
measurement can be based on classification of e.g. motion
vectors, dominant color, edge histogram and editing tempo.                  Table 2: Metadata available in the data structures.
By calculating the similarity of consecutive shots, groups can        Data Structure
be identified. “Overlapping-links” introduced by [16] was             Our proposed solution is derived from the concept of shot-
one of the early strategies to find structures inside of videos.      transition-graphs. We use a weighted directed graph for the
It was extended by [19, 20]. The algorithm can cluster similar        representation of hierarchical sequence structures in a video.
shots and the shots laying in between as a Logical shot units         Edges represent transitions between distinct shots or se-
(LSUs) [21].                                                          quences. Nodes represent single shots or sub-segments with
Our solution was inspired by overlapping-links, the concept           a new graph of shots inside. See Table 1.
of a Scene-Transition-Graph (STG) [14, 22] and the Scene              Sequence-Graph-Algorithm
Detection solution published by [15]. These approaches are            In order to access the video content in a graph based hierar-
still subject to actual publication and optimizations like [23,       chical structure, we create a directed graph to represent the
24, 25]. Shots are represented as nodes, transitions as edges.        video’s shots and sequences. The vertices belonging to a
Shots with a high similarity are clustered into group-nodes.
This process leads to a digraph with cycles.
                                 Figure 3: Visualization of vertices and edges of the Sequence-Graph
sequence are aggregated to build a second level in the hier-           created on the second level, representing the chain of shots
archy. Metadata created during the intellectual annotation             forming a sequence.
performs the aggregation.
                                                                       Similarity-Graph-Algorithm
Procedure Similarity-Graph                                             One important feature of videos from film and television is
input: List of Color-Similarity-Groups 𝐶, graph 𝐺1                     the presence of recurring images. This happens especially
output: Sequence-Graph with Similarity-Subgraphs 𝐺2                    when interviews or dialogs are recorded where the same in-
1. for each 𝑉𝑎𝑖 from 𝐺1 do                                             dividuals are shown several times. In terms of film grammar
2.    create new temporal graph 𝐺𝑡𝑖                                    this is called the shot-/ reverse-shot method. See Figure 4
3.    for each similarity group 𝑠𝑞𝑗 from 𝑆𝑞 do
                                                                       Resulting Graph Structure
4.         if one or more sub-vertex 𝑉𝑠𝑖 of 𝑉𝑎𝑖 is ∈ 𝑠𝑞𝑖 then
5.               add new group-vertex 𝑉𝑎𝑗 to 𝐺𝑡𝑗
                                                                       The final resulting graph represents the video in a hierar-
6.         end if
                                                                       chical structure. On the first level all sequences and all
7.    end for each                                                     standalone shots can be accessed. By selecting a sequence all
8.    for each sub-vertex 𝑉𝑠𝑖 of 𝑉𝑎𝑖 not from 𝑠𝑞𝑖 do                   shots and similarity groups inside the selected sequence can
9.         add new non-group-vertex 𝑉𝑠𝑗 to 𝐺𝑡𝑗                         be accessed. If a shot shows a similar image multiple times,
10. end for each                                                       each instance of this image is aggregated to a group. Recur-
11. for each sub-edge 𝐸𝑠 ∪ 𝐸𝑎 from 𝑉𝑎𝑖 do                              ring shots are recognizable by cyclic structures of the edges.
12.        add new edge 𝐸𝑎𝑗 to 𝐺𝑡𝑗 connecting the corresponding        On selecting a similarity group the individual instances of the
           vertex of its sources group-vertex 𝑉𝑎 and its targets       similar shots can be accessed. The results of the two cluster-
           group-vertex 𝑉𝑎, respectively the non-group-vertex 𝑉𝑠 if    ing-steps and the final 3-layer graph are shown in Figure 5.
           source or target is not part of an similarity group 𝑠𝑞𝑖 .   Figure 6 shows die visualization of a single layer as used in
13. end for each                                                       the UI.
14. Calculate the strongly connected components of 𝐺𝑡𝑗
15. for each strongly connected component 𝑠𝑐𝑐𝑘 from 𝐺𝑡𝑗 do
16.      create new similarity-vertex 𝑉𝑎𝑘 as sub-element in 𝑉𝑎𝑖
17.      for each shot-vertex 𝑉𝑠𝑙 from 𝑉𝑎𝑖 do
18.           if 𝑉𝑠𝑙 𝑖𝑠 𝑚𝑒𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑐𝑐𝑘 then
19.              remove 𝑉𝑠𝑙 and it’s out-edges and in-edge from 𝑉𝑎𝑖
20.              add 𝑉𝑎𝑙 and it’s edges as sub-elements to the 𝑉𝑎𝑘
21.           end if
22.      end for each
23.      for each 𝐸𝑠𝑚 removed from 𝑉𝑎𝑖 do
24.           add new 𝐸𝑎𝑚 edge in 𝑉𝑎𝑘 connecting 𝑉𝑎𝑘 with it’s
              predecessors resp. successors
25.      end for each
26. end for each
27. for each edge 𝐸𝑎𝑚 from 𝑉𝑎𝑖 do
28.     if more than one 𝐸𝑎 exists with the same source-vertex and
        the same target-vertex as 𝐸𝑎𝑘 than
29.        remove all duplicates and increment the weight of 𝐸𝑎𝑚
30.     end if
31. end for each
32. end for each
                                                                                      Figure 5: 𝑮𝟏 (Sequence-Graph),
             Figure 4: Similarity-Graph Procedure                               𝑮𝟐 (Similarity-Graph and Sequence-Graph)

The resulting Sequence-Graph (Algorithm in Figure 2) is                GRAPH-BASED USER INTERFACE
representing all content sequences as aggregated nodes and             UI approaches with the purpose of addressing structures in
the remaining singular nodes not belonging to a sequence on            video content have been developed mainly in in the fields of
the first level. Inside each aggregated node a sub-graph was           film studies and in human-computer-interaction (HCI). They
 normally focus on certain key aspects like analysis, descrip-      V. Details-view – shows all data that is available for one of
 tion [4] or summarization [1] of content. From their perspec-         the cards. It consists of several lines displaying key
 tive, the temporal order of a video’s single sequences is an          frames, detected faces, off text and text overlays.
 important bit of information and therefore one of the funda-
                                                                    EVALUATION
 mental principles of their modus operandi.                         We performed a first evaluation of our approach by using a
 By shifting the main focus to the video’s structure, we man-       combination of baseline tests and questionnaires. Therefore,
 aged to design a user interface that makes it possible to          we designed a set of tasks comparable to those described by
 quickly overlook a whole file without losing any detail.           our group of experts. A screenshot of the graph-based UI is
                                                                    depicted in Figure 7. The content used for evaluation consists
 Graph-based User Interface
                                                                    of real television news programs, produced during the early
 In order to avoid the issues reported by our user groups, as
                                                                    to mid-1990s. It was archived on VHS video tapes. The ac-
 described above, we decided to organize all available infor-
                                                                    tual test-set was composed by randomly selecting 1377
 mation in a way that emphasizes the video’s structure. Rich-
                                                                    minutes of this video material.
 ness of detail is increased from top (overview) to bottom (all
 details and metadata). The presented metadata-types are            Four expert users were asked to perform searching tasks.
 summarized in Table 2. The following interface description         They were given short descriptions of 27 randomly picked
 is connected to the layers presented in the Figures 6 and 7.       video sequences with durations between 5 seconds and 10
                                                                    minutes. The task was to find the described sequences in the
  I. Video player – The player can be used to examine the
                                                                    corresponding video file and to write down the time codes of
     single segments in any intended way. In order to provide
                                                                    the sequence boundaries. Searching tasks like these are quite
     permanent availability, it remains at the top of the screen
                                                                    comparable to the real live work of video editors, because
     when scrolling to the lower parts of the UI.
                                                                    video content in tape based archives is only marginally doc-
 II. Current graph – Its nodes represent either a single shot
                                                                    umented. Manual content browsing in a video player and
     group or a cluster of related groups. By using a simple
                                                                    non-linear editing software (NLE) is used to find sequences
     directed graph for the top level, we were able to display
                                                                    of video content reusable in new video clips.
     all nodes in a familiar left-to-right-order. Every node con-
     tains a representative image sample and some basic in-
     formation on its content. The existence of child graphs is
     color coded (blue) on this level of detail.
III. Collapsible container that is used to display a more gran-
     ular child graph belonging to a certain top-level node.
IV. Queue – Nodes can be transferred in a drag-and-drop op-
     erated queue of cards that offer a more detailed view of
     their content. Furthermore, they can be used to manage a
     collection of shots or shot groups that can be watched di-
     rectly or exported for further use e.g. in editing software.


                Figure 7: Schematic View of the UI.                        Figure 6: Multilayer-View of a Graph-based UI.
                                                                     text-based metadata should be presented at the different ele-
             Average Searching Time (seconds)                        ments to comply the need of the users. Currently, extensions
                                                                     of the UI are under development to enable a sync-function. It
        Premiere Pro                                                 will allow adapting the presented graph elements when the
            CS6                                                      current position in the video shifts to the next sequence. This
         VLC Media                                                   will give the UI a two-sided interaction between the video
           Player                                                    player and the graph structure.

     Graph-based UI                                                  CONCLUSION
                                                                     In this paper we presented our concept of a hierarchical
                         0        50    100        150         200
                                                                     presentation of video items in a graph-based structure. We
                                                                     described our framework which incorporates video and audio
                                        VLC                          analysis, intellectual annotation and graph analysis to con-
                              Graph-                Premiere
                                        Media                        struct a multi-layer structure for content-consumption. Our
                             based UI               Pro CS6
                                        Player                       web-based UI shows how classical sequential content brows-
     Average Searching                                               ing in videos can be extended to incorporate the inner struc-
                               93        122             179
      Time (seconds)
                                                                     tures and relations of the video’s sub-elements.
                                                                     ACKNOWLEDGMENTS
                   Figure 8: Evaluation results.                     Parts of this work were accomplished in the research project
For comparison, the searching tasks were performed by using          validAX funded by the German Federal Ministry of Educa-
our graph-based user interface, VLC Media Player and                 tion and Research.
Adobe Premiere Pro (CS 6). For each task the time needed             REFERENCES
for completion was recorded. Overall, 108 different search           [1] Adami, N., Benini s., Leonardi R. An overview of
operations were performed. Furthermore, differences in the               video shot clustering and summarization techniques
accuracy of the time-codes were taken into account. With the             for mobile applications. In Proc MobiMedia '06, ACM
graph-based UI, the average duration per searching task was              (2006), No. 27
93 seconds. When searching with VLC (average: 122 s) and
                                                                     [3] Knauf, R., Kürsten, J., Kurze, A., Ritter M., Berger A.,
Premiere Pro (average: 179s) significantly more time was
                                                                         Heinich S., Eibl M. Produce. annotate. archive. repur-
needed (Figure 8). As a result our graph-based solution out-
                                                                         pose --: accelerating the composition and metadata ac-
performed VLC and Premiere Pro. In VLC 27.8% more time
                                                                         cumulation of tv content. In Proc. AIEMPro’11, ACM
was needed. Searching in Premiere Pro needed 48.5% more
                                                                         (2011), 30-36
time. One reason for the weak performance of Premiere Pro
could be the zoom function. It was heavily used by the test-         [4] Korte, H. Einführung in die systematische Filmana-
ers, but leaded to longer searching times.                               lyse. Schmidt (1999), 40.
One disadvantage of the graph-based UI turned out to be the          [6] Lu, L. and Zhang, H.-J. Speaker change detection and
fact that entities and events inside a video shot cannot be iso-         tracking in real-time news broadcasting analysis. In
lated. They are bound to the boundaries of the surrounding               Proc MULTIMEDIA’02, ACM (2002), 602-610.
shot and cannot be exported independently. In terms of per-
ceiving the actual structure of the video, all users reported        [7] Rickert, M. and Eibl, M. A proposal for a taxonomy of
gaining a deeper understanding when using our approach                   semantic editing devices to support semantic classifi-
than when using VLC or Premiere Pro.                                     cation. In Proc. RACS 2014. ACM (2014), 34–39.

FUTURE WORK                                                          [8] Rickert, M. and Eibl, M. Evaluation of media analysis
The next step for the analysis and graph-based clustering will           and information retrieval solutions for audio-visual
be the substitution of the manual annotation of video se-                content through their integration in realistic workflows
quences by an automatic sequence segmentation algorithm.                 of the broadcast industry. In Proc. RACS 2013. ACM
Surveys on the state-of-the-art in video-segmentation indi-              Press (2013), 118–121.
cate that a multimodal fusion of the analysis results can be         [9] Ritter, M. and Eibl, M. 2011. An Extensible Tool for
used to cluster successive shots into video sequences. Most              the Annotation of Videos Using Segmentation and
approaches use visual similarity features. But as discussed in           Tracking. Lecture Notes in Computer Science. Sprin-
[7], concepts and rules from the production of video content             ger Berlin Heidelberg. 295–304.
can be useful to find sequences or scenes inside video con-
tent.                                                                [10] Ritter, Marc. Optimierung von Algorithmen zur Vide-
                                                                          oanalyse. Chemnitz (2013), 1–336. ISBN 978-3-
The graph-based user interface will be evaluated in addi-                 944640-09-9, 119-144, 187-213
tional user tests, exploring if its use is beneficial for non-pro-
fessional users as well. A second study will evaluate, which
[11] Heinich, S. Textdetektion und -extraktion mit gewich-     [19] Kwon, Y.-M., Song, C.-J. and Kim, I.-J. A new ap-
     teter DCT und mehrwertiger Bildzerlegung, In Proc.             proach for high level video structuring. In Proc. Multi-
     WAM 2009, TU-Chemnitz (2009), 151-162, ISBN                    media and Expo, ICME 2000, 773–776.
     978-3-000278-58-7
                                                               [20] Wang, W. and Gao, W. Automatic segmentation of
[12] Manthey, R., Herms, R., Ritter, M., Storz, M., Eibl, M.        news items based on video and audio features. In Proc.
     A Support Framework for Automated Video and Mul-               Advances in Multimedia Information Processing,
     timedia Workflows for Production and Archive. Proc.            PCM 2001, LNCS 2195, 498–505
     HCI International 2013. Springer (2013), 336-341.
                                                               [21] Vendrig, J. and Worring, M. Systematic evaluation of
[13] Del Fabro, M., Böszörmenyi, L. State-of-the-art and           logical story unit segmentation. Multimedia. In IEEE
     future challenges in video scene detection: a survey.         Transactions on. 4, 4 IEEE (2002), 492–499.
     Journal Multimedia systems Vol. 19. Issue 5. Springer
                                                               [22] Shi, J. and Malik, J. Normalized Cuts and Image Seg-
     (2013), 427–454
                                                                    mentation. In IEEE Transactions on Pattern Analysis
[14] Yeung, M.M., Boon-Lock Y. Time-constrained clus-               and Machine Intelligence. 22, 8 IEEE (2000), 888–905.
     tering for segmentation of video into story units, In
                                                               [23] Sidiropoulos, P., Mezaris, V., Kompatsiaris, I.,
     Proc. 13th IC on Pattern Recognition, IEEE (1996),
                                                                    Meinedo, H. and Trancoso, I. Multi-modal scene seg-
     375-380 vol.3, doi: 10.1109/ICPR.1996.546973.
                                                                    mentation using scene transition graphs. In Proc. MM
[15] Ngo, C.-W., Ma, Y.-F., Zhang, H.-J. Video summari-             '09 of the 17th ACM international conference on Multi-
     zation and scene detection by graph modeling. In Cir-          media.ACM(2009), 665-668.
     cuits and Systems for Video Technology vol. 15, issue
                                                               [24] Xu, S., Feng, B., Ding, P. and Xu, B. 2012. Graph-
     2, IEEE (2005), 296–305.
                                                                   based multi-modal scene detection for movie and tele-
[16] Hanjalic, A., Lagendijk, R. L., Biemond, J. Automated         play. In Proc. Acoustics, Speech and Signal Processing
     high-level movie segmentation for advanced video-re-          (ICASSP) 2012. IEEE (2012), 1413–1416.
     trieval systems. In Circuits and Systems for Video
                                                               [25] Porteous, J., Benini, S., Canini, L., Charles, F.,
     Technology vol. 9, issue 4, IEEE (1999), 580-5
                                                                   Cavazza, M. and Leonardi, R. 2010. Interactive story-
[17] Berger, A.; Kürsten, J.; Eibl, M.: Visual String of Re-       telling via video content recombination. In Proc. MM
     formulation. In Proc HCI International, Springer              '10 of the 17th ACM international conference on Multi-
     (2009) - LNCS 5618                                            media. ACM (2010), 1715–1718.
[18] Berger, A: Design Thinking for Search User Interface
     Design. In Proc EuroHCIR2011, Newcastle, (2011),
     38-41