Automation in Video Editing: Assisted workflows in video editing THAN HTUT SOE, University of Bergen, Norway Capturing, publishing and distribution of video content have become accessible and efficient. However, video editing task remains very time consuming. Video being a time-based and dual-tracked (audio and video) medium meant each video clips must be inspected and editing decisions has to be performed on individual frames. Introducing automation into video editing has been attempted since the beginning of digital video with little or no success so far. We, hereby, present an breakdown of tasks involved in video editing and argue for an approach to introduce automation these smaller tasks instead of entire editing workflow. By working on tasks, the impact of introduction of automation can be measured and user experience evaluated. In addition, we laid out the challenges in our approach to introducing automation into video editing and presented some AI techniques that can be applied to video editing workflows and AI concerns relevant to the topics. Additional Key Words and Phrases: video, video editing, automation, assisted workflow 1 INTRODUCTION Video is the most popular form of content on the Internet measured in internet traffic. According to the Cisco visual networking index [6], 75% of the Internet traffic in 2017 has been video content. With mobile phones, video sharing platforms and social media it is easier than ever to capture and publish videos. However, editing video is very time consuming. Video is a tedious medium to work with as it requires inspecting and manipulating videos at individual frames and it is a dual-track medium with both audio and video. There are various attempts to automate video editing and creating easier video editing workflows with semi-automation. In this position paper, we will focus on the latter and laid out the overview and challenges in introducing automation into video editing workflows. In the whole video production workflow, video editing is a part of the post-production process [13] which took place after the media assets has been created via filming or acquisition from other sources. Video editing is defined as the process of assembling shots and scenes into a final product, making decisions about their length and ordering [15]. Nonlinear editing is defined as editing that does not require that the sequence be worked on sequentially [15]. All modern digital video editing software are non-linear editors in which the original video, audio or images are not modified but a new edit is specified based on the cuts and modifications of the existing media assets. The video editing software such as Adobe Premier or Final Cut Pro has decades of development behind them. Being mature software, the user interfaces and interactions of video editing tools are very similar or common across many different video editing programs. However, these established interfaces and interactions are created and evolved to edit videos without automation. The editing made by non-linear editors consists of an ordered list of media assets used in the edit and time codes of used media assets and it is usually stored in a file format called edit decision list (EDL). Entirely automated video editing has received a lot of research interest. At present, automated video production is aimed at creating video summaries or mashups. Video summaries are highlights from video clips which fulfill some Workshop proceedings Automation Experience at the Workplace In conjunction with CHI'21, May 7th, 2021, Yokohama, Japan Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Website: http://everyday-automation.tech-experience.at 1 selection criteria such as importance, being aesthetically pleasing, or having some level of interests [5, 18]. Mashups are videos that are produced by concatenating video segments from video clips which are usually recorded at the same event by different cameras [14, 20]. These completely automated video editing methods, as used to create video summaries and mashups, are intended to create a simple highlight compilation of the videos and there is no user interaction involved. It is clear that though fully automated video editing is useful for some cases, the usage of them is very limited in the workplace as they are simply highlights generators that are not configurable. Intelligent video editing tools have been attempted since the beginning of digital video editing with the goal of making video editing easier. These tools make video editing easier by offering semi-automation or allowing manipulation of videos at a higher level of abstraction than just manipulating frames (e.g. spoken words, shots and dialogue). One early example of such a tool is Silver [4] from 2002, which provides smart selections of video clips, as well as abstract views of video editing, by using metadata on from the videos. A more recent example of an intelligent video editing tool is Roughcut [11]. Roughcut allows the computational editing for dialog driven scenes using user input of dialog for the scene, raw recordings, and editing idioms. There is an open source toolautoEdit [16] and a research [2], which enable text-based editing of video interviews by linking text transcripts to the videos. 2 ASSISTED WORKFLOWS IN VIDEO EDITING Typical tasks involved in video editing are described in Figure 1. This set of tasks has not been verified with industry practices but rather created from consultation with just a single video editing product manager. The main purpose of the task is to introduce the tasks involved in video editing. The presence, emphasis and order of each of the tasks will have different permutations depending on video editing contexts. The tasks required for video editing can depends on both the type of the video being edited as well as the organization context where the video is being produced. Fig. 1. Tasks involved in video editing Semi-automated video editing tools that address the whole video editing process usually targets only a particular type of video editing need such as an interview view [2], an instructional video or a dialogue scene[11]. This is because discussing the entire video editing workflow is very context dependent and complex. In addition, the video workflows 2 might be personalized as well. Therefore, it might be more suitable to introduce automation for each individual task or a few combinations of tasks at first. Then we can provide a smaller automation blocks that the editors can personalized to fit their needs. Another argument for introducing automation at task level is that it is each individual task can be measured and the changes user experience introduced by automation in each task can be studied. On top of that automating the whole video editing process is a task that is not suitable for current approaches in machine learning-based automation as it is difficult task to create a dataset for the correct way of editing a whole video. We have studied the semi-automation in assisted subtitling in our work accepted for publication in Interactive Media Experiences 2021 as discussed in next paragraph [17]. How addition of semi-automation or having to work with automation in a subtitling changes performance, behavior, and experiences of novice users in subtitling is explored in the authors’ paper [17]. In the paper the assisted subtitling prototype with speech to text allows the novice users to create slightly more accurate subtitles and much more efficiently. However, the users rate the experience with assisted subtitling more difficult than starting from scratch. The users experiences with introduction of automation into the subtitling workflow is summarized in the paper and it laid out usability problems that needs to be addressed for efficient human-machine collaboration in subtitling. In addition, the possible collaboration with state-of-the-art machine learning based speech to text systems and users in subtitling is laid out. 3 CHALLENGES IN INTRODUCING AUTOMATION INTO VIDEO WORKFLOWS Efficiency or introduction of using automation or semi-automation in the workplace depends on crating efficient human- machine communication and well designed user interfaces that enable the communication. Introducing automation into a product itself is a challenge and some of the research highlights the challenges and general guidelines involved in building AI infused products [1]. The eighteen guidelines provided covers both building user experience, clarifying user expectations, matching social norms and learning from users’ behaviour [1]. However, applying these guidelines for any specific scenario would require reevaluating them in that specific context. AI techniques have been developed to manipulate or create video. Earliest work Video Rewrite [3] uses existing footage of a person to automatically create a video of said person speaking to a different audio track. This work was done with the intention of facilitating movie dubbing. AI synthesizing videos from existing footage became popular once again after deep neural networks were trained to synthesize fake videos - a process known as creating deep fakes. According to [12], a deep fake is a content generated by AI that is authentic in the eyes of a human observer. Deep fakes inspire negative concerns, however, they do also have potential applications in generating or adapting video content. How AI techniques can be used to extract information from videos is a very diverse area of research. We are particularly interested in facial recognition, object detection, object tracking, scene detection, sentiment analysis, video reasoning and video captioning. Facial recognition refers to the problem of identifying whether a human face is present in an image, and possibly whose, while object detection is the problem of identifying a specific object in an image. Object tracking is the problem of identifying and locating a specific object and tracking its movement across the frames in a video. Scene detection, or video segmentation, is identifying segments which are semantically or visually related in a video. Sentiment analysis is the problem of matching the sentiment that would be conveyed by a given content: is it happy, sad, ironic etc. Video captioning [20] is an AI technique that generates natural descriptions that capture the dynamics of the video. 3 Video editing workflows are complicated processes and depends on the context of the video production. In this position paper we try to laid out some of the challenges with video editing which is the part of post-production process of video workflows. Based our assisted workflow in subtitling evaluation, there are changes in performance characteristics as well as in user experience when automation automation in comparison with existing workflows. Similarly, the challenges in automation in video workflows can be summarized into • Understanding existing workflows • Deciding on which part of the workflows to automate and to which extent • Working with the automation/AI technology • Understanding the impact of automation by evaluation the tool Upon solving these challenges, we can use the knowledge we have gained to create better tools with automation for the users. In addition, automation methods that can learn or adjust to user feedbacks can be explored. Simply plugging ML into an video editing tool does not help user understand and utilize what ML could and could not do. Study of human factors and UI designs for ML is necessary to explore how users understand ML and affordances provided by introducing ML in the process. Dove et al. [8] stated in their study that ML is both underexplored opportunity for HCI researchers and has unknown potential as design material. The authors also pointed out the challenges which are the lack of understanding of ML in UX community, the data dependent nature of ML blackboxes and the difficulty of making interactive prototypes with ML. There are emerging fields of study in AI that aims to put human at the center of control. An example from creative work is writing with machines in the loop [7]. Clark et al. [7] performed an experiment with two machine-in-the-loop systems for story writing and slogan writing tasks and the participants enjoyed collaborating with machine even though third-party evaluations rated of stories written with machine-generated suggestions are not as good as stories written by humans alone. Visual story telling models generates descriptions of a series of pictures that describes an event. Hus et.al [10] analyzed how humans edits those machine-generated text. Explainable Artificial Intelligence (XAI) is an emerging field in machine learning to come up with techniques that are more explainable to human users[9]. 4 CONCLUSION When introducing automation into video editing work, there are three important factors to consider: understanding existing video editing tasks and workflows, deciding where and how to introduce automation and finally user experience of automated workflows. We have provided a sample of a video editing workflow; however it has to be elaborated and verified in a study with professionals in the industry. Deciding where and how to introduce automation could benefit from studying what are the needs and expectations with automation in the industry. The user experience of automated workflows depends a lot on the interactions and interfaces between human and automation. The results from our evaluation of automated subtitling tool workflow suggests that just adding automation onto existing interfaces is not sufficient. Since introduction of automation changes the way the tool is used, new interactions have to be crafted informed by users’ experience and needs. We argue for tasks-based automation in video editing workflows because in the context of introducing automation in video editing which is a creative and subjective task, it is better to automate away the tedious and time consuming tasks and leave the creative or the task of telling the story to human editors. However, there is a lot of potential in other types of automation in video editing such as writing a video with text [19] or computational highlight generation from video archives. Since, automation or AI technology in video is rapidly evolving, there should be more attempts to 4 introduce automation into not only video editing but the entire video production workflow. The success of introduction of automation in our opinion depends on crafting a good user experience and adaptable automation. REFERENCES [1] Saleema Amershi, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, Eric Horvitz, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, and Paul N. Bennett. 2019. Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems - CHI ’19. ACM Press, Glasgow, Scotland Uk, 1–13. https://doi.org/10.1145/3290605.3300233 [2] Floraine Berthouzoz. [n.d.]. Tools for Placing Cuts and Transitions in Interview Video. ([n. d.]), 8. [3] Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: driving visual speech with audio. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques - SIGGRAPH ’97. ACM Press, Not Known, 353–360. https://doi.org/10.1145/258734.258880 [4] Juan Casares, A Chris Long, Brad A Myers, Rishi Bhatnagar, Scott M Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. [n.d.]. Simplifying Video Editing Using Metadata. ([n. d.]), 10. [5] Chong-Wah Ngo, Yu-Fei Ma, and Hong-Jiang Zhang. 2005. Video summarization and scene detection by graph modeling. IEEE Transactions on Circuits and Systems for Video Technology 15, 2 (Feb. 2005), 296–305. https://doi.org/10.1109/TCSVT.2004.841694 [6] V Cisco. 2018. Cisco visual networking index: Forecast and trends, 2017–2022. White Paper 1 (2018), 1. [7] Elizabeth Clark, Anne Spencer Ross, Chenhao Tan, Yangfeng Ji, and Noah A. Smith. 2018. Creative Writing with a Machine in the Loop: Case Studies on Slogans and Stories. In Proceedings of the 2018 Conference on Human Information Interaction&Retrieval - IUI ’18. ACM Press, Tokyo, Japan, 329–340. https://doi.org/10.1145/3172944.3172983 [8] Graham Dove, Kim Halskov, Jodi Forlizzi, and John Zimmerman. 2017. UX Design Innovation: Challenges for Working with Machine Learning as a Design Material. ACM Press, 278–288. https://doi.org/10.1145/3025453.3025739 [9] David Gunning. 2017. Explainable Artificial Intelligence (XAI). (Nov. 2017), 38. [10] Ting-Yao Hsu, Yen-Chia Hsu, and Ting-Hao ’Kenneth’ Huang. 2019. On How Users Edit Computer-Generated Visual Stories. arXiv:1902.08327 [cs] (Feb. 2019). http://arxiv.org/abs/1902.08327 arXiv: 1902.08327. [11] Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. 2017. Computational video editing for dialogue-driven scenes. ACM Transactions on Graphics 36, 4 (July 2017), 1–14. https://doi.org/10.1145/3072959.3073653 [12] Yisroel Mirsky and Wenke Lee. 2021. The Creation and Detection of Deepfakes: A Survey. Comput. Surveys 54, 1 (Jan. 2021), 1–41. https: //doi.org/10.1145/3425780 arXiv: 2004.11138. [13] Frank Nack. 2005. Capture and transfer of metadata during video production. In Proceedings of the ACM workshop on Multimedia for human communication from capture to convey - MHC ’05. ACM Press, Hilton, Singapore, 17. https://doi.org/10.1145/1099376.1099382 [14] Duong Trung Dung Nguyen, Axel Carlier, Wei Tsang Ooi, and Vincent Charvillat. 2014. Jiku director 2.0: a mobile video mashup system with zoom and pan using motion maps. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, Orlando Florida USA, 765–766. https://doi.org/10.1145/2647868.2654884 [15] Jeffrey A. Okun, Susan Zwerman, Kevin Rafferty, and Scott Squires (Eds.). 2015. The VES handbook of visual effects: industry standard VFX practices and procedures. Focal Press, Taylor & Francis Group, New York. [16] Pietro Passarelli. 2019. autoEdit Fast Text Based Video Editing. http://www.autoedit.io/ [17] Than Htut Soe, Frode Guribye, and Marija Slavkovik. 2021. Evaluating AI Assisted Subtitling. Accepted for publication at ACM International Conference on Interactive Media Experiences (IMX 2021) 2021 (2021). [18] C.M. Taskir, Z. Pizlo, A. Amir, D. Ponceleon, and E.J. Delp. 2006. Automated video program summarization using speech transcripts. IEEE Transactions on Multimedia 8, 4 (Aug. 2006), 775–791. https://doi.org/10.1109/TMM.2006.876282 [19] Miao Wang, Guo-Wei Yang, Shi-Min Hu, Shing-Tung Yau, and Ariel Shamir. 2019. Write-a-video: computational video montage from themed text. ACM Transactions on Graphics 38, 6 (Nov. 2019), 1–13. https://doi.org/10.1145/3355089.3356520 [20] Zuxuan Wu, Ting Yao, Yanwei Fu, and Yu-Gang Jiang. 2016. Deep Learning for Video Classification and Captioning. arXiv preprint arXiv:1609.06782 (2016). 5