A Joint Attention Model for Automated Editing Hui-Yin Wu Arnav Jhala huiyin wu@ncsu.edu ahjhala@ncsu.edu North Carolina State University Abstract We introduce a model of joint attention for the task of automatically editing video recordings of corporate meetings. In a multi-camera set- ting, we extract pose data of participants from each frame and audio amplitude on individual headsets. These are used as features to train a system to predict the importance of each camera. Editing decisions and rhythm are learned from a corpus of human-expert edited videos. A Long-Short Term Memory (LSTM) neural network is then used to pre- dict the joint attention by training on video and audio features from expert-edited videos, and editing predictions are made on other test data. The output of the system is an editing plan for the meeting in Edit Description Language (EDL) format. 1 Introduction Joint attention is an element of human communication to draw the attention of others through non-verbal processes such as gaze and gesture [Hea05]. Editors uses continuity and analytical editing to imitate the audience’s “natural attention” by using gaze to draw the audience’s attention to elements of setting and story [Bor85, Smi12]. The explosive growth of everyday video recordings calls for smart methods that can understand context, and automatically process and present data in a meaningful way. While intelligent camera switching technology is available to a some extent, it is based primarily on audio, movement, and other low level features of video streams. Existing work shows that LSTMs have been effective for video summarization tasks [ZCSG16] due to their ability to model long ranged variable dependencies across shots. However, using LSTMs for a complex task such as video editing is currently insufficiently addressed. We present a joint attention-based editing model where we extract audio and pose data to train an LSTM to rank each focal point in the room at each second, and produce an automated edit of videos. In order to be unbiased, the videos we choose are meeting recordings from the AMI corpus established by the University of Edinburgh [MCK+ 05], where 100+ hours of meetings were recorded in smart meeting rooms equipped with multiple cameras, individual headsets, and other annotations. We address four main challenges: 1. detecting head pose and audio amplitude from each camera 2. training an LSTM model to rank joint attention of focal points in the room based on expert edits 3. generating an edit of the meeting using the score output by the LSTM Copyright c by H-Y. Wu, A. Jhala. Copying permitted for private and academic purposes. In: H. Wu, M. Si, A. Jhala (eds.): Proceedings of the Joint Workshop on Intelligent Narrative Technologies and Workshop on Intelligent Cinematography and Editing, Edmonton, Canada, 11-2018, published at http://ceur-ws.org The extraction of the head pose and audio data, and the training of the model are pre-processed, while the final editing can be done in real time. We envision that the automated editing techniques built in this work could be broadly applicable to intelligent camera systems for both productivity and entertainment. The rest of the paper first covers the related work in the field. After providing an overview, we introduce the calculation of head pose, and the technical details of the LSTM. Finally, we discuss limitations and future work. 2 Related Work In HCI, sound and motion have been used as metrics for automated camera capture systems for editing meeting or lecture videos [BLA12, RBB08, LRGC01] to create edits. Arev et al. [APS+ 14] was the first to propose using joint attention in social cameras (i.e. cameras carried around by people during a group activity). Editing tools that take into consideration cinematographic rules are also emerging. Ozeki et al. [ONO04] created a system that generates attention-based edits of cooking shows by detecting speech cues and gestures. Notably Leake et al. [LDTA17] focuses on dialogue scenes and provides a quick approach that evaluates film idioms, and selects a camera for each line of dialogue. Video abstraction or summarization techniques generally focus on the problem of selecting keyframes or identifying scene sections for a compact representation of a video. Ma et al. [MLZL02] were the first to propose a joint attention model for video summarization comprising the audio, movement, and textual information. Lee et al.[LG15] generated storyboards of daily tasks recorded on a wearable camera based on gaze, hand proximity, and frequency of appearance of an object. LSTMs have been used for video summarization tasks [ZCSG16] due to their ability to model long ranged variables dependencies, outside of a single frame. Virtual cinematography has been a prominent area of study in graphics, where the challenge is often how to place cameras for story and navigation. Jhala and Young designed the Darshak system which generates a sequence of shots that fulfills goals of showing specific story actions and events [JY11]. Common editing rules [GRLC15] and film editing patterns [WC15] have been applied to virtual camera systems to ensure spatial and temporal continuity, and defining common film idioms. Our work differs from virtual cinematography in two main ways: (1) in a 3D environment, object locations and camera parameters are precise, whereas in our video data, camera parameters and positions of people and objects can only be estimated, and (2) the virtual camera can be moved anywhere, while the cameras in the meeting recordings are fixed. 3 Joint Attention Model (b) Participant A’s focus (indicated by arrow color) (a) Configuration of the meeting room and cameras. The based on A’s head pose and configuration of the presentation and whiteboard is situated at the front of meeting room. If two or more targets are close, the room, visible from the center camera. there can be ambiguity as to what A is looking at. Figure 1 We present the joint attention model from which we obtain features to train an LSTM. Meetings are selected from the videos in the AMI Corpus [MCK+ 05]: IDs IS1000a, IS1008d, and IS1009d, representing 3 types of interactions (brainstorming, status update, and project planning). Each meeting is around 25 minutes. These meetings were held with 4 people in a smart meeting room equipped with 7 cameras and individual microphone headsets. The configuration of the room is shown in Figure 1a. The 7 cameras include 1 overview, 2 side cameras, and 4 closeups on each participant. 3.1 Determining Focal Points To extract the head pose of people in the video, we use the OpenPose library’s[CSWS17] MPI 15-keypoint detection, which has five points on the head: 2 ears, 2 eyes, and nose tip. Pose is calculated per frame, 25 fps. For each point, the on-screen (x,y) coordinates are given, with a confidence score between 0 and 1. This information is assumed as input for the camera, which then calculates the confidence level of a head orientation based on visibility of five facial points: the two eyes and ears, and the nose. The eight possible head orientations are (F)ront (i.e. looking in the same direction as the camera), (B)ehind (i.e. looking at the camera), (L)eft and (R)ight from the camera’s perspective (with only the nose tip and one eye and ear visible), and variations of these: LF (left-front), RF (right-front), LB (left-behind), and RB (right-behind). The confidence level for a head orientation is the sum of the confidence score c of each of the n points pv that should be visible, and the sum of 1-c for the m points ph that should be hidden. m X n X PoseConfidence = cpvi + (1 − cphj ) (1) i=1 j =1 The pose is then mapped to focal points corresponding to the head orientation, such as the example shown in Figure 1b for participant A. Each pose can refer to one, none, or more than one focal point. The focal point matrix (Table 1) shows the mapping of head orientation-camera to the corresponding focal point(s). Table 1: The focal point matrix indicates for the L, R and Closeup (CU) cameras, what each camera shows, and what focal point a participants’ head orientation would correspond to. The focal points in the room are the presentation (P), the whiteboard (W), and participants A, B, C, and D. Camera Shows L R LF RF F LB RB B L AC PA C - - - PW D BD R BD D PWB - - - C A AC CU.A A P C - - - PW D B CU.B B D PW - - - C A C CU.C C PA - - - - PW D B CU.D D C PWB - - - C PWB A The confidence level that the focal point is a specific target (a person/object) t for a camera C is the average of the confidence levels for the x head orientations o that have the target as a focal point. This value is summed up for each person p detected by the camera. n Px o=1 co,p (t) X FocalPointConfidence(C , t) = (2) p=1 x This gives a focal point confidence level to each target in the room from the viewpoint of C, which ranks the importance of the target. With the importance of the targets in the room ranked by each camera, we then calculate the focal point of all participants. The focus F for a target t is the accumulated focal point confidence for each camera, since the more cameras that feel that t is the focal point, the higher this score should be for t. 7 X F (t) = FocalPointConfidence(Cq , t) (3) q=1 This score and the extracted amplitude of individual headsets are used as input features to train our LSTM to identify the joint attention of the room, and generate the edit of the video, which we detail in the next section. 4 LSTM Model and Results Our neural network is composed of an input layer, an output layer, and 3 hidden LSTM layers with 100 neurons each. It uses an MAE loss function and adam optimizer, trained for 1000 epochs. We choose LSTM layers as Figure 2: This figure shows a comparison of the audio-based, and LSTM-based edit to the expert version on one clip in terms of camera selection. Green indicates the same camera, yellow indicates partial targets or temporal displacement, orange an acceptable alternative, and red indicates a complete miss. In the clip (from meeting 1008d), A and C are debating the microphone feature of their product. opposed to only fully connected layers due to the fact that editing is a decision-making process that takes into account both immediate (e.g. who just moved or started talking) and long-term (e.g. who has been talking for the past few seconds) observations, as well as previous editing decisions. The output is the joint attention score, between 0.0 and 1.0, of 6 targets in the room: each of the four participants, the whiteboard, and the presentation. Presumably, a higher score implies that the target is more likely the joint attention of the room. The ground truth input to our model comes from a film expert who edited the three chosen meeting videos based on what they felt was the joint attention of the room, and we used Table 1 to convert this into attention scores. For example, if the expert chose the Left camera at time t of the meeting, the ground truth score label for participants At and Ct would be 1.0, while the score 0.0 would be assigned to participants Bt , Dt , and the Whiteboard Wt , and Presentation Pt . An exponential decay function smooths out the scores before and after the shot. We then design a simple editing program, which (1) chooses the Center camera when the Wt or Pt has a high score, and if not, (2) determines if a single participant X scores over 0.5, in which case, the program chooses the closeup on X, or (3) if the score of two participants from the same side of the table have a score significant difference over the two on the other side, a medium shot that shows the two participants with the higher score is chosen, otherwise, (4) the Center camera is chosen. To provide a baseline for comparison, an audio-based edit is generated by selecting the closeup camera that shows the person with the highest microphone input. Figure 2 shows one clip of the audio-based and LSTM- learned joint attention score based edit, compared with the expert edit. We find that the audio edit often correctly shows who is talking, and in a simple scenario has high accuracy. However, the output is jittery and subject to noise in the audio data, resulting in more errors. It also chooses only closeup shots, which is not suitable for situations where multiple participants are in a discussion. In contrast, the LSTM-joint attention approach switches cameras in a more timely fashion, in many cases capturing both action and essential reactions of the meetings. It also identifies scenarios with interchanges or discussions between multiple participants more accurately than the audio-based approach, and uses the central camera to capture these exchanges. The LSTM learned joint attention produces a comparable edit to the human one, even with the limited amount of training data and no smoothing. 5 Conclusion This work presents the idea and an initial formulation of Joint Attention with edited meeting videos to form a baseline corpus. With more advanced pose-detection algorithms, and high quality video data, we envision this work to have real-time editing applications using pre-trained LSTM models. Beyond basic criteria of pacing or shot similarity, we also hope to establish a baseline for future work on automated editing systems in various scenarios such as classrooms, film, and performing arts. In conclusion, we have introduced a joint attention model for an automated editing tool based on LSTM. Our generated output was evaluated against a baseline audio edit, compared to the expert edit, and showed that it did reasonably well in terms of camera selection and pacing. References [APS+ 14] Ido Arev, Hyun Soo Park, Yaser Sheikh, Jessica Hodgins, and Ariel Shamir. Automatic editing of footage from multiple social cameras. ACM Trans. Graph., 33(4):81:1–81:11, July 2014. [BLA12] Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. Tools for placing cuts and transitions in interview video. ACM Trans. Graph., 31(4):67:1–67:8, July 2012. [Bor85] David Bordwell. Narrative in the Fiction Film. University of Wisconsin Press, 1985. [CSWS17] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7291–7299, Honolulu, Hawaii, USA, 2017. IEEE Xplore. [GRLC15] Quentin Galvane, Rémi Ronfard, Christophe Lino, and Marc Christie. Continuity Editing for 3D Animation. In AAAI Conference on Artificial Intelligence, pages 753–761, Austin, Texas, United States, January 2015. AAAI Press. [Hea05] Jane Heal. Joint Attention: Communication and Other Minds: Issues in Philosophy and Psychology. Oxford University Press, 2005. [JY11] Arnav Jhala and R. Michael Young. Intelligent Machinima Generation for Visual Storytelling, pages 151–170. Springer New York, New York, NY, 2011. [LDTA17] Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. Computational video editing for dialogue-driven scenes. In Proceedings of SIGGRAPH 2017, 2017. [LG15] Yong Jae Lee and Kristen Grauman. Predicting important objects for egocentric video summariza- tion. International Journal of Computer Vision, 114(1):38–55, Aug 2015. [LRGC01] Qiong Liu, Yong Rui, Anoop Gupta, and JJ Cadiz. Automating camera management for lecture room environments. In Proceedings of SIGCHI’01, volume 3, pages 442–449, March 2001. [MCK+ 05] Iain Mccowan, J Carletta, Wessel Kraaij, Simone Ashby, S Bourban, M Flynn, M Guillemot, Thomas Hain, J Kadlec, Vasilis Karaiskos, M Kronenthal, Guillaume Lathoud, Mike Lincoln, Agnes Lisowska Masson, Wilfried Post, D Reidsma, and P Wellner. The ami meeting corpus. In Interna- tional Conference on Methods and Techniques in Behavioral Research, page 702 pp., Wageningen, Netherlands, 01 2005. Wageningen: Noldus Information Technology. [MLZL02] Yu-Fei Ma, Lie Lu, Hong-Jiang Zhang, and Mingjing Li. A user attention model for video summa- rization. In Proceedings of the Tenth ACM International Conference on Multimedia, MULTIMEDIA ’02, pages 533–542, New York, NY, USA, 2002. ACM. [ONO04] M. Ozeki, Y. Nakamura, and Y. Ohta. Video editing based on behaviors-for-attention - an approach to professional editing using a simple scheme. In 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), volume 3, pages 2215–2218 Vol.3, June 2004. [RBB08] Abhishek Ranjan, Jeremy Birnholtz, and Ravin Balakrishnan. Improving meeting capture by apply- ing television production principles with audio and motion detection. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’08, pages 227–236, New York, NY, USA, 2008. ACM. [Smi12] Tim J. Smith. The attentional theory of cinematic continuity. Projections, 6(1):1–27, 2012. [WC15] Hui-Yin Wu and Marc Christie. Stylistic Patterns for Generating Cinematographic Sequences. In 4th Workshop on Intelligent Cinematography and Editing Co-Located w/ Eurographics 2015, pages 47–53, Zurich, Switzerland, May 2015. Eurographics Association. The definitive version is available at http://diglib.eg.org/. [ZCSG16] Ke Zhang, Wei-Lun Chao, Fei Sha, and Kristen Grauman. Video summarization with long short- term memory. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 766–782, Cham, 2016. Springer International Publishing.