A Graph-Based Approach and Analysis Framework for Hierarchical Content Browsing Markus Rickert Benedikt Etzold Maximilian Eibl Technische Universität Chemnitz Technische Universität Chemnitz Technische Universität Chemnitz Straße der Nationen 62 Straße der Nationen 62 Straße der Nationen 62 D-09111 Chemnitz, Germany D-09111 Chemnitz, Germany D-09111 Chemnitz, Germany markus.rickert@cs.tu-chemitz.de benedikt.etzold@cs.tu-chemitz.de maximilian.eibl@cs.tu-chemitz.de ABSTRACT In this paper we present our approach to provide a hierar- Systems for multimedia retrieval have been object of scien- chical presentation of video items to support professional us- tific research for many years. When it comes to present re- ers while browsing and consuming the content of a media sults to the user many solutions disregard the set of problems retrieval system. Based on the primary focus of video content connected to content delivery. Especially time-constrained from television programs, this solution works best on video results of video retrieval systems need a different visualiza- material edited in a post-production workflow. It is not sup- tion. In this paper we present our solution for hierarchical posed to be used on e.g. surveillance videos. Our framework content browsing of video files. Our workflow covers the has been developed to provide automatic and intellectual an- phases of ingest, transcoding, automatic analysis, intellectual notation to historical television recorded on video tapes. The annotation and data aggregation. We describe an algorithm digitized master copies and their metadata can be searched for the graph-based analysis of the content structure in vid- and displayed in a web-based user interface (UI). Video shots eos. By identifying the requirements of professional users we and sequences can be explored as a hierarchical structure in developed a user interface enabling to access retrieval results the UI. The system is in use in a pilot project by the “media in different hierarchical abstraction levels. state authority of Saxony” (Sächsische Landesmedienanstalt) Author Keywords in Germany. Content-browsing, video analysis, video retrieval, graph- USER REQUIREMENTS & EXISTING WORKFLOWS based analysis, visualization, algorithm, user interface Our use case focuses on user groups in professions that rely ACM Classification Keywords heavily on reviewing large amounts of video data on a daily E.1 [Data Structures]: Graphs and networks, H.5.1 [Infor- basis like journalists, editors or historians. mation Interfaces and Presentation]: Multimedia, I.2.10 In a set of interviews we asked a group of experts to describe [Vision and Scene Understanding]: Video analysis, their daily work. Thereby, we especially focused on those ar- INTRODUCTION eas that deal with the examination of the results of archive Compared to other areas of information retrieval, the con- queries. Other fields of interest were the process of querying, tent-browsing of audiovisual media bears special challenges. preferred software solutions and the planning of new reports Videos are time-dependent. Usually the user’s intention is to or videos. Our findings were subsequently merged into an find an element inside a video, depicting a certain semantic extensive workflow that was used for identifying different concept like a person, topic, location or event. By querying a problem areas. video database, the returned result is either a complete video Altogether, we spoke to three experts from three different item or a single element inside a video item determined by German TV stations, who all work in the field of TV journal- its time position. Professional users are not mainly interested ism. Their similar statements and their reports on the work- in finding only a single occurrence of the queued semantic flows of other professionals and institutions give reason to concept. They want to gather the whole sequence related to believe that our workflow is representative for a significant their search query, e.g. to reuse it in a news report or for his- part in this field of work. Conducting surveys and interviews toric research. The user usually sees the retrieval result as a [17, 18], we identified some of the main problems they face starting point for a further manual searching process inside as a part of their working routine: the video item which is operated by using the playback and seek functions of the player software.  Metadata is often either fragmentary, or missing com- pletely. While standards or recommendations exist in most professions, they are usually ignored due to bottle- necks in time and personnel. "3rd International Workshop on Interactive Content Consumption at TVX’15, June 3rd, 2015, Brussels, Belgium.  Video data is normally stored in its final state, e.g. a film Copyright is held by the author(s)/owner(s)." that has already been edited in post-production. In the case of search queries returning more than one result, us- ers often receive a single file containing a queue of all relevant video files.  In TV production, time pressure is always high because and web services. This provides a redundant and hardware of narrow schedules and the need for instant coverage of independent service, while supporting a variety of separate current events. execution environments for each component. It also allows for a possible scale-out with additional hardware if needed. Specific software solutions addressing these issues do not yet exist in professional scenarios. This leads to a highly ineffi- The execution workflow for an individual video tape or file cient workflow: Precision rates are usually low because of consists of five phases, as depicted in Figure 1. On the level the described storing modalities and the lack of precise of each stage it is intended to reach a maximum of concur- metadata. Therefore, numerous files of comparatively large rency. size have to be inspected in a short period of time. Classical User Interfaces The software that is used is normally designed to handle the simple consumption of video content (e.g. VLC Media Player or Apple QuickTime) or the tasks of professional post-production (e.g. Avid MediaComposer or Adobe Prem- iere). Both approaches are based on a perspective that em- phasizes the linear structure of the completed video whilst or after the process of editing. By showing an ordered sequence of single shots, they present the content in consideration of the editor’s intention but not of the needs of an expert using a retrieval system. Requirements Based on these findings, we compiled a list of requirements that have to be met by a user interface to improve the user experience significantly:  Metadata is usable for both video processing and visuali- zation.  Information can be displayed based on the video’s struc- ture. Figure 1: Framework Workflow in its five Phases.  Richness of detail can be increased for single segments of the video. I. Digitization and Transcoding The very first step is an incoming control of each video tape  The video itself can be accessed through any bit of infor- and the generation of a unique identifier. Our id system con- mation displayed in the UI. sists of a 12-byte block and can be represented and displayed  Relevant segments of the video can be used in later steps for human reading as a combination of 12 hexadecimal digits of the user’s workflow, e.g. editing. (e.g. 0000-0074-0000-0026-Z) in four segments plus a cal- FRAMEWORK culated check character. After the initial logging, the video Our framework provides functionalities for audio and video tape is digitized with an automatic robot ingest system as de- analysis, manual annotation, data warehousing, retrieval and scribed in [12]. It is running batch jobs in parallel on up to visualization. It uses specialized components for each aspect. six tape players. The core “dispatcher” is controlling the analysis process, al- location of work units and data aggregation. As deduced The resulting digital master file is encoded as a broadband from [8], the requirements for a scalable analysis system IMX50 video codec captured in an mxf-container for archiv- based on heterogeneous scientific algorithms on the field of ing and data exchange. As defined by [8] we create proxy audio and video analysis are complex. The framework is pre- versions of the archive file by transcoding it. For automatic sented here in its complete workflow for the first time. Ear- annotation, analysis and as a preview video for the web UI, lier publications covered only aspects of distinct compo- we use an h.264 codec at level 4.1 wrapped in an mp4-con- nents. A predecessor partial framework was presented in [3]. tainer. II. Automatic Analysis and Annotation Our framework needs to support individual solutions, pro- The created analysis proxy video is transferred to the analysis grammed in varying languages, based on different operating cluster. The dispatcher schedules the analysis of each video system environments and requesting various quantities of re- file as a sequence of consecutive analysis steps. For perfor- sources. Therefor it runs in an environment of virtual ma- mance reasons, each component can be instantiated multiple chines on a cluster of five Intel Xeon dual-quad-core host times. In the common configuration, the system runs with up servers. The main components were written in C# .Net to 12 individual virtual machines. The analysis components source code and make use of service-orientated-architecture are controlled by the dispatcher via a web-service interfaces. Shot detection component each video tape in form of classical intellectual annotation as The shot detection is the first component in the workflow. It it is already implemented in media archives. provides a segmentation of the continuous video stream into Scene & Topic Annotation parts of uninterrupted camera recordings (shots). The algo- We developed a web-based annotation tool for the intellec- rithms developed by [9] and [10] are based on calculating the tual annotation of the analyzed video files. To support the cumulated error rate of individual motion vectors for each professional user, the tool makes use of the detected video image block between two successive frames. The compo- shots. The video is presented in slices of camera shots. The nent’s output is a list of metadata for every detected shot. Key video player repeats the current shot in a loop. This makes it frames of the shot are extracted for use in the UI and succes- easier for the user to fill out all input fields, without con- sive components. stantly dealing with the player controls. When the user is fin- Face detection component ished with the shot, he can jump to the next. The user marks The face detection component uses the key frames from the the boundaries of storyline sequences as collections of mul- shot detection to mark bounding boxes around each detected tiple shots and adds a variety of bibliographical metadata like face. The used algorithm is optimized for high precision and title and type of video content, topics and subjects in terms developed by [10]. It is specialized on data corpora from lo- of individuals, locations, institutions, things, creations (like cal television broadcasts. Its result data is a set of metadata art) and other metadata useful for either information retrieval of the bounding box around detected faces and a sample im- or as development test data. age for each detected face. IV. Data Aggregation Text extraction component In the past we were only analyzing video assets for isolated The text extraction component detects areas of overlay text scientific experiments. To process large quantities of videos boxes within the video steam. The algorithm by [11] uses a now, the integration of results from different analysis algo- weighted discrete cosine transform (DCT) to detect macro- rithms becomes a key challenge. For our environment of use blocks by finding regions located in the medium frequency cases, a data-warehouse solution is needed to aggregate more spectrum. By normalizing the eigenvalues, a mask is calcu- than only the results of video analysis. On the one hand it lated which is used to separate the textbox from the rest of needs to incorporate the metadata supplied from its sources, the image. For text to character transformation the software like production information and data from TV broadcasters. tesseract-ocr is used (https://code.google.com/p/tesseract- On the other hand it has to provide its data as an export arti- ocr/). The component creates key frame samples of the de- fact witch is compatible with formats and conventions used tected textboxes, metadata about the locations of the text- by achieving facilities and institutes. boxes and the extracted text from the OCR (optical character recognition). A special challenge was to find a scheme, which complies with the way video content producers and archives structure Speech Recognition their data, and includes technical data, like feature-vectors The Speech Recognition component makes use of the and audiovisual classifications. Our selected database- speaker change recognition method described by [6] and ex- scheme is adapted from a common standard for video docu- tended by [3]. It provides data for the differentiation of indi- mentation1 developed for the German public television. We vidual voices and pauses. By applying Gaussian Mixture combined metadata fields from the point of mandatory and Models individual speaker can be trained and recognized. optional meta-data classes with the goal to maintain a maxi- The detected utterances of individual speakers are transferred mum of compatibility. to an automatic speech recognition (ASR) software. The re- sulting data provides not only the recognized words. It adds V. Content Delivery and Visualization metadata about the time position and duration of the utter- For data exchange and archiving, the digital master file, the ance and an id code for identification and re-recognition of proxy files and the metadata are exported to a LTO tape li- the speaker. brary. Search and content access for the user are provided by a webserver. III. Intellectual Annotation The user interface is used for web-based intellectual annota- This framework is not only used for demonstration of our so- tion, controlling the analysis process, information retrieval lutions. It is in productive use for archiving historical tape and for content browsing. The UI is able to handle multiple based material. This constitutes the need for additional intel- tenants and has as scalable interface for different display res- lectual annotation, since today’s automatic annotation can olutions or devices. Each function runs in its own web-app. provide support, but it cannot substitute the intellectual work of a human entirely. Secondly, the manual annotated metadata is used as training sets and test sets for the devel- opment of new algorithms. Therefor we collect metadata for 1 (REM http://rmd.dra.de/remid.php?id=REM_APR_3) GRAPH-BASED VIDEO CLUSTERING Description During the analysis and automatic annotation we extract seg- Singular Edges – Directed edge between two Sin- 𝐸𝑠 ments of camera shots from the video stream. This shot-seg- gular Nodes (Ns) representing the transition from a mentation is helpful for content browsing, but it suffers from camera shot to its successor in the sequence of the over-segmentation. The structure is too detailed for the visu- video. alization of the actions inside a video. The user needs to be Singular Nodes – A single continuous camera shot. able to search for scenes or sequences as basic units. 𝑉𝑠 𝐸𝑎 Aggregated Edge – Directed edge between two Ag- Procedure Sequence-Graph gregated Nodes (Na) or between a Singular Node input: List of detected shot-boundaries and transitions 𝑆ℎ, list of and an Aggregated Node. It represents a set of in- sequences 𝑆𝑞. terrelated Singular Nodes, respectively a sub-graph output: Sequence-graph 𝐺1 (𝐸𝑎 , 𝑉𝑎 ) containing a scene in the video. 1. for each detected shot and transition 𝑠ℎ𝑖 from 𝑆ℎ do Aggregated Node – A group of Singular or Aggre- 2. add new vertex 𝑉𝑠𝑖 to 𝐺1 𝑉𝑎 gated Nodes as a sequence or sub-graph. 3. add new edge 𝐸𝑠𝑖 to 𝐺1 connecting 𝑉𝑠𝑖 and 𝑉𝑠𝑖+1 𝐶 Color-Similarity-Group (𝐶) — A list of shots, 4. end for each grouped by its visual similarity. The similarity is 5. for each sequence 𝑠𝑞𝑗 from 𝑆𝑞 do measured by a combination of the MPEG- 6. create new 𝑉𝑎𝑗 in 𝐺1 descriptors Edge-Histogram (EHD) and Color-Lay- 7. end for each out (CLD). [10 pp.169] 8. for each 𝑉𝑠𝑖 from 𝐺1 do 𝑆𝑞 Sequence-List (𝑆𝑞) – A List of shots, grouped by 9. if 𝑉𝑠𝑖 belongs to sequence 𝑠𝑞𝑗 then their affiliation to a sequence, found by intellectual 10. remove 𝑉𝑠𝑖 and it’s out-edges and in-edge from 𝐺1 annotation. A sequence represents a segment of con- 11. add 𝑉𝑠𝑖 and it’s edges as sub-elements to the 𝑉𝑎𝑗 tinuous action or location in a video. 12. end if 13. end for each Table 1: Data structures. 14. for each 𝐸𝑠𝑖 removed from 𝐺1 Metadata & Parameter 15. add new 𝐸𝑎𝑖 edge in 𝐺1 connecting 𝑉𝑎𝑗 with the predeces- sors resp. successors of 𝑉𝑠𝑖  Duration of the transition. 16. end for each 𝐸𝑠  Type of transition (cut, wipe, dissolve, fade). 17. for each 𝐸𝑎𝑘 from 𝐺1 do As described in the taxonomy by [7] 18. if more than one 𝐸𝑎 exists with the same source-vertex  Number of the shot. and the same target-vertex as 𝐸𝑎𝑘 than  Times of start, duration and end of the shot 19. remove all duplicates and increment the weight of 𝐸𝑎𝑘  Extracted keyframes of the first and last frame. 20. end if 𝑉𝑠  Extracted keyframes from face detection 21. end for each  Data from text extraction Figure 2: Procedure to create a Sequence-Graph. 𝐸𝑎  Weight. Different approaches for clustering or grouping of related shots were published. A detailed survey on the field of video  A representative keyframe. segmentation is given by [13].  Start-time of the earliest sub-element. 𝑉𝑎  End-time of the latest sub-element. A common strategy in many clustering approaches is to find  Metadata of the speech recognition. structures and similarities in the given video. The similarity  Annotation: topic, location, subjects, individuals etc. measurement can be based on classification of e.g. motion vectors, dominant color, edge histogram and editing tempo. Table 2: Metadata available in the data structures. By calculating the similarity of consecutive shots, groups can Data Structure be identified. “Overlapping-links” introduced by [16] was Our proposed solution is derived from the concept of shot- one of the early strategies to find structures inside of videos. transition-graphs. We use a weighted directed graph for the It was extended by [19, 20]. The algorithm can cluster similar representation of hierarchical sequence structures in a video. shots and the shots laying in between as a Logical shot units Edges represent transitions between distinct shots or se- (LSUs) [21]. quences. Nodes represent single shots or sub-segments with Our solution was inspired by overlapping-links, the concept a new graph of shots inside. See Table 1. of a Scene-Transition-Graph (STG) [14, 22] and the Scene Sequence-Graph-Algorithm Detection solution published by [15]. These approaches are In order to access the video content in a graph based hierar- still subject to actual publication and optimizations like [23, chical structure, we create a directed graph to represent the 24, 25]. Shots are represented as nodes, transitions as edges. video’s shots and sequences. The vertices belonging to a Shots with a high similarity are clustered into group-nodes. This process leads to a digraph with cycles. Figure 3: Visualization of vertices and edges of the Sequence-Graph sequence are aggregated to build a second level in the hier- created on the second level, representing the chain of shots archy. Metadata created during the intellectual annotation forming a sequence. performs the aggregation. Similarity-Graph-Algorithm Procedure Similarity-Graph One important feature of videos from film and television is input: List of Color-Similarity-Groups 𝐶, graph 𝐺1 the presence of recurring images. This happens especially output: Sequence-Graph with Similarity-Subgraphs 𝐺2 when interviews or dialogs are recorded where the same in- 1. for each 𝑉𝑎𝑖 from 𝐺1 do dividuals are shown several times. In terms of film grammar 2. create new temporal graph 𝐺𝑡𝑖 this is called the shot-/ reverse-shot method. See Figure 4 3. for each similarity group 𝑠𝑞𝑗 from 𝑆𝑞 do Resulting Graph Structure 4. if one or more sub-vertex 𝑉𝑠𝑖 of 𝑉𝑎𝑖 is ∈ 𝑠𝑞𝑖 then 5. add new group-vertex 𝑉𝑎𝑗 to 𝐺𝑡𝑗 The final resulting graph represents the video in a hierar- 6. end if chical structure. On the first level all sequences and all 7. end for each standalone shots can be accessed. By selecting a sequence all 8. for each sub-vertex 𝑉𝑠𝑖 of 𝑉𝑎𝑖 not from 𝑠𝑞𝑖 do shots and similarity groups inside the selected sequence can 9. add new non-group-vertex 𝑉𝑠𝑗 to 𝐺𝑡𝑗 be accessed. If a shot shows a similar image multiple times, 10. end for each each instance of this image is aggregated to a group. Recur- 11. for each sub-edge 𝐸𝑠 ∪ 𝐸𝑎 from 𝑉𝑎𝑖 do ring shots are recognizable by cyclic structures of the edges. 12. add new edge 𝐸𝑎𝑗 to 𝐺𝑡𝑗 connecting the corresponding On selecting a similarity group the individual instances of the vertex of its sources group-vertex 𝑉𝑎 and its targets similar shots can be accessed. The results of the two cluster- group-vertex 𝑉𝑎, respectively the non-group-vertex 𝑉𝑠 if ing-steps and the final 3-layer graph are shown in Figure 5. source or target is not part of an similarity group 𝑠𝑞𝑖 . Figure 6 shows die visualization of a single layer as used in 13. end for each the UI. 14. Calculate the strongly connected components of 𝐺𝑡𝑗 15. for each strongly connected component 𝑠𝑐𝑐𝑘 from 𝐺𝑡𝑗 do 16. create new similarity-vertex 𝑉𝑎𝑘 as sub-element in 𝑉𝑎𝑖 17. for each shot-vertex 𝑉𝑠𝑙 from 𝑉𝑎𝑖 do 18. if 𝑉𝑠𝑙 𝑖𝑠 𝑚𝑒𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑐𝑐𝑘 then 19. remove 𝑉𝑠𝑙 and it’s out-edges and in-edge from 𝑉𝑎𝑖 20. add 𝑉𝑎𝑙 and it’s edges as sub-elements to the 𝑉𝑎𝑘 21. end if 22. end for each 23. for each 𝐸𝑠𝑚 removed from 𝑉𝑎𝑖 do 24. add new 𝐸𝑎𝑚 edge in 𝑉𝑎𝑘 connecting 𝑉𝑎𝑘 with it’s predecessors resp. successors 25. end for each 26. end for each 27. for each edge 𝐸𝑎𝑚 from 𝑉𝑎𝑖 do 28. if more than one 𝐸𝑎 exists with the same source-vertex and the same target-vertex as 𝐸𝑎𝑘 than 29. remove all duplicates and increment the weight of 𝐸𝑎𝑚 30. end if 31. end for each 32. end for each Figure 5: 𝑮𝟏 (Sequence-Graph), Figure 4: Similarity-Graph Procedure 𝑮𝟐 (Similarity-Graph and Sequence-Graph) The resulting Sequence-Graph (Algorithm in Figure 2) is GRAPH-BASED USER INTERFACE representing all content sequences as aggregated nodes and UI approaches with the purpose of addressing structures in the remaining singular nodes not belonging to a sequence on video content have been developed mainly in in the fields of the first level. Inside each aggregated node a sub-graph was film studies and in human-computer-interaction (HCI). They normally focus on certain key aspects like analysis, descrip- V. Details-view – shows all data that is available for one of tion [4] or summarization [1] of content. From their perspec- the cards. It consists of several lines displaying key tive, the temporal order of a video’s single sequences is an frames, detected faces, off text and text overlays. important bit of information and therefore one of the funda- EVALUATION mental principles of their modus operandi. We performed a first evaluation of our approach by using a By shifting the main focus to the video’s structure, we man- combination of baseline tests and questionnaires. Therefore, aged to design a user interface that makes it possible to we designed a set of tasks comparable to those described by quickly overlook a whole file without losing any detail. our group of experts. A screenshot of the graph-based UI is depicted in Figure 7. The content used for evaluation consists Graph-based User Interface of real television news programs, produced during the early In order to avoid the issues reported by our user groups, as to mid-1990s. It was archived on VHS video tapes. The ac- described above, we decided to organize all available infor- tual test-set was composed by randomly selecting 1377 mation in a way that emphasizes the video’s structure. Rich- minutes of this video material. ness of detail is increased from top (overview) to bottom (all details and metadata). The presented metadata-types are Four expert users were asked to perform searching tasks. summarized in Table 2. The following interface description They were given short descriptions of 27 randomly picked is connected to the layers presented in the Figures 6 and 7. video sequences with durations between 5 seconds and 10 minutes. The task was to find the described sequences in the I. Video player – The player can be used to examine the corresponding video file and to write down the time codes of single segments in any intended way. In order to provide the sequence boundaries. Searching tasks like these are quite permanent availability, it remains at the top of the screen comparable to the real live work of video editors, because when scrolling to the lower parts of the UI. video content in tape based archives is only marginally doc- II. Current graph – Its nodes represent either a single shot umented. Manual content browsing in a video player and group or a cluster of related groups. By using a simple non-linear editing software (NLE) is used to find sequences directed graph for the top level, we were able to display of video content reusable in new video clips. all nodes in a familiar left-to-right-order. Every node con- tains a representative image sample and some basic in- formation on its content. The existence of child graphs is color coded (blue) on this level of detail. III. Collapsible container that is used to display a more gran- ular child graph belonging to a certain top-level node. IV. Queue – Nodes can be transferred in a drag-and-drop op- erated queue of cards that offer a more detailed view of their content. Furthermore, they can be used to manage a collection of shots or shot groups that can be watched di- rectly or exported for further use e.g. in editing software. Figure 7: Schematic View of the UI. Figure 6: Multilayer-View of a Graph-based UI. text-based metadata should be presented at the different ele- Average Searching Time (seconds) ments to comply the need of the users. Currently, extensions of the UI are under development to enable a sync-function. It Premiere Pro will allow adapting the presented graph elements when the CS6 current position in the video shifts to the next sequence. This VLC Media will give the UI a two-sided interaction between the video Player player and the graph structure. Graph-based UI CONCLUSION In this paper we presented our concept of a hierarchical 0 50 100 150 200 presentation of video items in a graph-based structure. We described our framework which incorporates video and audio VLC analysis, intellectual annotation and graph analysis to con- Graph- Premiere Media struct a multi-layer structure for content-consumption. Our based UI Pro CS6 Player web-based UI shows how classical sequential content brows- Average Searching ing in videos can be extended to incorporate the inner struc- 93 122 179 Time (seconds) tures and relations of the video’s sub-elements. ACKNOWLEDGMENTS Figure 8: Evaluation results. Parts of this work were accomplished in the research project For comparison, the searching tasks were performed by using validAX funded by the German Federal Ministry of Educa- our graph-based user interface, VLC Media Player and tion and Research. Adobe Premiere Pro (CS 6). For each task the time needed REFERENCES for completion was recorded. Overall, 108 different search [1] Adami, N., Benini s., Leonardi R. An overview of operations were performed. Furthermore, differences in the video shot clustering and summarization techniques accuracy of the time-codes were taken into account. With the for mobile applications. In Proc MobiMedia '06, ACM graph-based UI, the average duration per searching task was (2006), No. 27 93 seconds. When searching with VLC (average: 122 s) and [3] Knauf, R., Kürsten, J., Kurze, A., Ritter M., Berger A., Premiere Pro (average: 179s) significantly more time was Heinich S., Eibl M. Produce. annotate. archive. repur- needed (Figure 8). As a result our graph-based solution out- pose --: accelerating the composition and metadata ac- performed VLC and Premiere Pro. In VLC 27.8% more time cumulation of tv content. In Proc. AIEMPro’11, ACM was needed. Searching in Premiere Pro needed 48.5% more (2011), 30-36 time. One reason for the weak performance of Premiere Pro could be the zoom function. It was heavily used by the test- [4] Korte, H. Einführung in die systematische Filmana- ers, but leaded to longer searching times. lyse. Schmidt (1999), 40. One disadvantage of the graph-based UI turned out to be the [6] Lu, L. and Zhang, H.-J. Speaker change detection and fact that entities and events inside a video shot cannot be iso- tracking in real-time news broadcasting analysis. In lated. They are bound to the boundaries of the surrounding Proc MULTIMEDIA’02, ACM (2002), 602-610. shot and cannot be exported independently. In terms of per- ceiving the actual structure of the video, all users reported [7] Rickert, M. and Eibl, M. A proposal for a taxonomy of gaining a deeper understanding when using our approach semantic editing devices to support semantic classifi- than when using VLC or Premiere Pro. cation. In Proc. RACS 2014. ACM (2014), 34–39. FUTURE WORK [8] Rickert, M. and Eibl, M. Evaluation of media analysis The next step for the analysis and graph-based clustering will and information retrieval solutions for audio-visual be the substitution of the manual annotation of video se- content through their integration in realistic workflows quences by an automatic sequence segmentation algorithm. of the broadcast industry. In Proc. RACS 2013. ACM Surveys on the state-of-the-art in video-segmentation indi- Press (2013), 118–121. cate that a multimodal fusion of the analysis results can be [9] Ritter, M. and Eibl, M. 2011. An Extensible Tool for used to cluster successive shots into video sequences. Most the Annotation of Videos Using Segmentation and approaches use visual similarity features. But as discussed in Tracking. Lecture Notes in Computer Science. Sprin- [7], concepts and rules from the production of video content ger Berlin Heidelberg. 295–304. can be useful to find sequences or scenes inside video con- tent. [10] Ritter, Marc. Optimierung von Algorithmen zur Vide- oanalyse. Chemnitz (2013), 1–336. ISBN 978-3- The graph-based user interface will be evaluated in addi- 944640-09-9, 119-144, 187-213 tional user tests, exploring if its use is beneficial for non-pro- fessional users as well. A second study will evaluate, which [11] Heinich, S. Textdetektion und -extraktion mit gewich- [19] Kwon, Y.-M., Song, C.-J. and Kim, I.-J. A new ap- teter DCT und mehrwertiger Bildzerlegung, In Proc. proach for high level video structuring. In Proc. Multi- WAM 2009, TU-Chemnitz (2009), 151-162, ISBN media and Expo, ICME 2000, 773–776. 978-3-000278-58-7 [20] Wang, W. and Gao, W. Automatic segmentation of [12] Manthey, R., Herms, R., Ritter, M., Storz, M., Eibl, M. news items based on video and audio features. In Proc. A Support Framework for Automated Video and Mul- Advances in Multimedia Information Processing, timedia Workflows for Production and Archive. Proc. PCM 2001, LNCS 2195, 498–505 HCI International 2013. Springer (2013), 336-341. [21] Vendrig, J. and Worring, M. Systematic evaluation of [13] Del Fabro, M., Böszörmenyi, L. State-of-the-art and logical story unit segmentation. Multimedia. In IEEE future challenges in video scene detection: a survey. Transactions on. 4, 4 IEEE (2002), 492–499. Journal Multimedia systems Vol. 19. Issue 5. Springer [22] Shi, J. and Malik, J. Normalized Cuts and Image Seg- (2013), 427–454 mentation. In IEEE Transactions on Pattern Analysis [14] Yeung, M.M., Boon-Lock Y. Time-constrained clus- and Machine Intelligence. 22, 8 IEEE (2000), 888–905. tering for segmentation of video into story units, In [23] Sidiropoulos, P., Mezaris, V., Kompatsiaris, I., Proc. 13th IC on Pattern Recognition, IEEE (1996), Meinedo, H. and Trancoso, I. Multi-modal scene seg- 375-380 vol.3, doi: 10.1109/ICPR.1996.546973. mentation using scene transition graphs. In Proc. MM [15] Ngo, C.-W., Ma, Y.-F., Zhang, H.-J. Video summari- '09 of the 17th ACM international conference on Multi- zation and scene detection by graph modeling. In Cir- media.ACM(2009), 665-668. cuits and Systems for Video Technology vol. 15, issue [24] Xu, S., Feng, B., Ding, P. and Xu, B. 2012. Graph- 2, IEEE (2005), 296–305. based multi-modal scene detection for movie and tele- [16] Hanjalic, A., Lagendijk, R. L., Biemond, J. Automated play. In Proc. Acoustics, Speech and Signal Processing high-level movie segmentation for advanced video-re- (ICASSP) 2012. IEEE (2012), 1413–1416. trieval systems. In Circuits and Systems for Video [25] Porteous, J., Benini, S., Canini, L., Charles, F., Technology vol. 9, issue 4, IEEE (1999), 580-5 Cavazza, M. and Leonardi, R. 2010. Interactive story- [17] Berger, A.; Kürsten, J.; Eibl, M.: Visual String of Re- telling via video content recombination. In Proc. MM formulation. In Proc HCI International, Springer '10 of the 17th ACM international conference on Multi- (2009) - LNCS 5618 media. ACM (2010), 1715–1718. [18] Berger, A: Design Thinking for Search User Interface Design. In Proc EuroHCIR2011, Newcastle, (2011), 38-41