<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Joint Attention Model for Automated Editing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hui-Yin Wu huiyin wu@ncsu.edu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Copyright c by H-Y. Wu, A. Jhala. Copying permitted for private and academic purposes. In: H. Wu, M. Si, A. Jhala (eds.): Proceedings of the Joint Workshop on Intelligent Narrative Technologies and Workshop on Intelligent Cinematography and Editing</institution>
          ,
          <addr-line>Edmonton</addr-line>
          ,
          <country country="CA">Canada</country>
          ,
          <addr-line>11-2018, published at</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>We introduce a model of joint attention for the task of automatically editing video recordings of corporate meetings. In a multi-camera setting, we extract pose data of participants from each frame and audio amplitude on individual headsets. These are used as features to train a system to predict the importance of each camera. Editing decisions and rhythm are learned from a corpus of human-expert edited videos. A Long-Short Term Memory (LSTM) neural network is then used to predict the joint attention by training on video and audio features from expert-edited videos, and editing predictions are made on other test data. The output of the system is an editing plan for the meeting in Edit Description Language (EDL) format.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>1. detecting head pose and audio amplitude from each camera
3. generating an edit of the meeting using the score output by the LSTM</p>
      <p>The extraction of the head pose and audio data, and the training of the model are pre-processed, while the
nal editing can be done in real time. We envision that the automated editing techniques built in this work
could be broadly applicable to intelligent camera systems for both productivity and entertainment.</p>
      <p>The rest of the paper rst covers the related work in the eld. After providing an overview, we introduce the
calculation of head pose, and the technical details of the LSTM. Finally, we discuss limitations and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>In HCI, sound and motion have been used as metrics for automated camera capture systems for editing meeting
or lecture videos [BLA12, RBB08, LRGC01] to create edits. Arev et al. [APS+14] was the rst to propose using
joint attention in social cameras (i.e. cameras carried around by people during a group activity). Editing tools
that take into consideration cinematographic rules are also emerging. Ozeki et al. [ONO04] created a system
that generates attention-based edits of cooking shows by detecting speech cues and gestures. Notably Leake et
al. [LDTA17] focuses on dialogue scenes and provides a quick approach that evaluates lm idioms, and selects a
camera for each line of dialogue.</p>
      <p>Video abstraction or summarization techniques generally focus on the problem of selecting keyframes or
identifying scene sections for a compact representation of a video. Ma et al. [MLZL02] were the rst to propose
a joint attention model for video summarization comprising the audio, movement, and textual information. Lee
et al.[LG15] generated storyboards of daily tasks recorded on a wearable camera based on gaze, hand proximity,
and frequency of appearance of an object. LSTMs have been used for video summarization tasks [ZCSG16] due
to their ability to model long ranged variables dependencies, outside of a single frame.</p>
      <p>Virtual cinematography has been a prominent area of study in graphics, where the challenge is often how
to place cameras for story and navigation. Jhala and Young designed the Darshak system which generates a
sequence of shots that ful lls goals of showing speci c story actions and events [JY11]. Common editing rules
[GRLC15] and lm editing patterns [WC15] have been applied to virtual camera systems to ensure spatial and
temporal continuity, and de ning common lm idioms.</p>
      <p>Our work di ers from virtual cinematography in two main ways: (1) in a 3D environment, object locations
and camera parameters are precise, whereas in our video data, camera parameters and positions of people and
objects can only be estimated, and (2) the virtual camera can be moved anywhere, while the cameras in the
meeting recordings are xed.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Joint Attention Model</title>
      <p>(b) Participant A's focus (indicated by arrow color)
(a) Con guration of the meeting room and cameras. The based on A's head pose and con guration of the
presentation and whiteboard is situated at the front of meeting room. If two or more targets are close,
the room, visible from the center camera. there can be ambiguity as to what A is looking at.</p>
      <p>We present the joint attention model from which we obtain features to train an LSTM. Meetings are selected
from the videos in the AMI Corpus [MCK+05]: IDs IS1000a, IS1008d, and IS1009d, representing 3 types of
PoseCon dence =
m
X cpvi +
i=1
n
X(1
j =1
cphj )</p>
      <p>The pose is then mapped to focal points corresponding to the head orientation, such as the example shown
in Figure 1b for participant A. Each pose can refer to one, none, or more than one focal point. The focal point
matrix (Table 1) shows the mapping of head orientation-camera to the corresponding focal point(s).
interactions (brainstorming, status update, and project planning). Each meeting is around 25 minutes.</p>
      <p>These meetings were held with 4 people in a smart meeting room equipped with 7 cameras and individual
microphone headsets. The con guration of the room is shown in Figure 1a. The 7 cameras include 1 overview,
2 side cameras, and 4 closeups on each participant.
To extract the head pose of people in the video, we use the OpenPose library's[CSWS17] MPI 15-keypoint
detection, which has ve points on the head: 2 ears, 2 eyes, and nose tip. Pose is calculated per frame, 25 fps.
For each point, the on-screen (x,y) coordinates are given, with a con dence score between 0 and 1.</p>
      <p>This information is assumed as input for the camera, which then calculates the con dence level of a head
orientation based on visibility of ve facial points: the two eyes and ears, and the nose. The eight possible head
orientations are (F)ront (i.e. looking in the same direction as the camera), (B)ehind (i.e. looking at the camera),
(L)eft and (R)ight from the camera's perspective (with only the nose tip and one eye and ear visible), and
variations of these: LF (left-front), RF (right-front), LB (left-behind), and RB (right-behind). The con dence
level for a head orientation is the sum of the con dence score c of each of the n points pv that should be visible,
and the sum of 1-c for the m points ph that should be hidden.
(1)
(2)
(3)
Camera
L
R
CU.A
CU.B
CU.C
CU.D</p>
      <p>Shows</p>
      <p>AC
BD
A
B
C
D</p>
      <p>L
PA
D
P
D
PA
C</p>
      <p>R</p>
      <p>C
PWB</p>
      <p>C
PW</p>
      <p>PWB</p>
      <p>LF</p>
      <p>RF</p>
      <p>F</p>
      <p>LB
PW
C
PW
C
PW
C</p>
      <p>RB
D
A
D
A</p>
      <p>D
PWB</p>
      <p>B
BD
AC
B
C
B
A</p>
      <p>This gives a focal point con dence level to each target in the room from the viewpoint of C, which ranks
the importance of the target. With the importance of the targets in the room ranked by each camera, we then
calculate the focal point of all participants. The focus F for a target t is the accumulated focal point con dence
for each camera, since the more cameras that feel that t is the focal point, the higher this score should be for t.</p>
      <p>The con dence level that the focal point is a speci c target (a person/object) t for a camera C is the average
of the con dence levels for the x head orientations o that have the target as a focal point. This value is summed
up for each person p detected by the camera.</p>
      <p>n
FocalPointCon dence(C ; t ) = X
p=1</p>
      <p>Pxo=1 co;p (t )</p>
      <p>x
F (t ) =
7
X FocalPointCon dence(Cq ; t )
q=1</p>
      <p>This score and the extracted amplitude of individual headsets are used as input features to train our LSTM to
identify the joint attention of the room, and generate the edit of the video, which we detail in the next section.
4</p>
    </sec>
    <sec id="sec-4">
      <title>LSTM</title>
    </sec>
    <sec id="sec-5">
      <title>Model and Results</title>
      <p>Our neural network is composed of an input layer, an output layer, and 3 hidden LSTM layers with 100 neurons
each. It uses an MAE loss function and adam optimizer, trained for 1000 epochs. We choose LSTM layers as
opposed to only fully connected layers due to the fact that editing is a decision-making process that takes into
account both immediate (e.g. who just moved or started talking) and long-term (e.g. who has been talking
for the past few seconds) observations, as well as previous editing decisions. The output is the joint attention
score, between 0.0 and 1.0, of 6 targets in the room: each of the four participants, the whiteboard, and the
presentation. Presumably, a higher score implies that the target is more likely the joint attention of the room.</p>
      <p>The ground truth input to our model comes from a lm expert who edited the three chosen meeting videos
based on what they felt was the joint attention of the room, and we used Table 1 to convert this into attention
scores. For example, if the expert chose the Left camera at time t of the meeting, the ground truth score label
for participants At and Ct would be 1.0, while the score 0.0 would be assigned to participants Bt, Dt, and the
Whiteboard Wt, and Presentation Pt. An exponential decay function smooths out the scores before and after
the shot.</p>
      <p>We then design a simple editing program, which (1) chooses the Center camera when the Wt or Pt has a high
score, and if not, (2) determines if a single participant X scores over 0.5, in which case, the program chooses
the closeup on X, or (3) if the score of two participants from the same side of the table have a score signi cant
di erence over the two on the other side, a medium shot that shows the two participants with the higher score
is chosen, otherwise, (4) the Center camera is chosen.</p>
      <p>To provide a baseline for comparison, an audio-based edit is generated by selecting the closeup camera that
shows the person with the highest microphone input. Figure 2 shows one clip of the audio-based and
LSTMlearned joint attention score based edit, compared with the expert edit.</p>
      <p>We nd that the audio edit often correctly shows who is talking, and in a simple scenario has high accuracy.
However, the output is jittery and subject to noise in the audio data, resulting in more errors. It also chooses only
closeup shots, which is not suitable for situations where multiple participants are in a discussion. In contrast, the
LSTM-joint attention approach switches cameras in a more timely fashion, in many cases capturing both action
and essential reactions of the meetings. It also identi es scenarios with interchanges or discussions between
multiple participants more accurately than the audio-based approach, and uses the central camera to capture
these exchanges. The LSTM learned joint attention produces a comparable edit to the human one, even with
the limited amount of training data and no smoothing.
5</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>This work presents the idea and an initial formulation of Joint Attention with edited meeting videos to form a
baseline corpus. With more advanced pose-detection algorithms, and high quality video data, we envision this
work to have real-time editing applications using pre-trained LSTM models. Beyond basic criteria of pacing or
shot similarity, we also hope to establish a baseline for future work on automated editing systems in various
scenarios such as classrooms, lm, and performing arts.</p>
      <p>In conclusion, we have introduced a joint attention model for an automated editing tool based on LSTM. Our
generated output was evaluated against a baseline audio edit, compared to the expert edit, and showed that it
did reasonably well in terms of camera selection and pacing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [APS+14]
          <string-name>
            <surname>Ido</surname>
            <given-names>Arev</given-names>
          </string-name>
          , Hyun Soo Park, Yaser Sheikh, Jessica Hodgins, and
          <string-name>
            <given-names>Ariel</given-names>
            <surname>Shamir</surname>
          </string-name>
          .
          <article-title>Automatic editing of footage from multiple social cameras</article-title>
          .
          <source>ACM Trans. Graph</source>
          .,
          <volume>33</volume>
          (
          <issue>4</issue>
          ):
          <volume>81</volume>
          :1{
          <fpage>81</fpage>
          :
          <fpage>11</fpage>
          ,
          <year>July 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [BLA12]
          <string-name>
            <given-names>Floraine</given-names>
            <surname>Berthouzoz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Wilmot</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Maneesh</given-names>
            <surname>Agrawala</surname>
          </string-name>
          .
          <article-title>Tools for placing cuts and transitions in interview video</article-title>
          .
          <source>ACM Trans. Graph</source>
          .,
          <volume>31</volume>
          (
          <issue>4</issue>
          ):
          <volume>67</volume>
          :1{
          <issue>67</issue>
          :
          <fpage>8</fpage>
          ,
          <string-name>
            <surname>July</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Bor85]
          <string-name>
            <given-names>David</given-names>
            <surname>Bordwell</surname>
          </string-name>
          .
          <article-title>Narrative in the Fiction Film</article-title>
          . University of Wisconsin Press,
          <year>1985</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [CSWS17]
          <string-name>
            <given-names>Zhe</given-names>
            <surname>Cao</surname>
          </string-name>
          , Tomas Simon,
          <string-name>
            <surname>Shih-En Wei</surname>
            , and
            <given-names>Yaser</given-names>
          </string-name>
          <string-name>
            <surname>Sheikh</surname>
          </string-name>
          .
          <article-title>Realtime multi-person 2d pose estimation using part a nity elds</article-title>
          .
          <source>In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , pages
          <fpage>7291</fpage>
          {
          <fpage>7299</fpage>
          ,
          <string-name>
            <surname>Honolulu</surname>
          </string-name>
          , Hawaii, USA,
          <year>2017</year>
          . IEEE Xplore.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [GRLC15]
          <string-name>
            <given-names>Quentin</given-names>
            <surname>Galvane</surname>
          </string-name>
          , Remi Ronfard, Christophe Lino, and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Christie</surname>
          </string-name>
          .
          <article-title>Continuity Editing for 3D Animation</article-title>
          .
          <source>In AAAI Conference on Arti cial Intelligence</source>
          , pages
          <fpage>753</fpage>
          {
          <fpage>761</fpage>
          ,
          <string-name>
            <surname>Austin</surname>
          </string-name>
          , Texas, United States,
          <year>January 2015</year>
          . AAAI Press.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          Oxford University Press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Arnav</given-names>
            <surname>Jhala</surname>
          </string-name>
          and
          <string-name>
            <given-names>R. Michael</given-names>
            <surname>Young</surname>
          </string-name>
          .
          <article-title>Intelligent Machinima Generation for Visual Storytelling</article-title>
          , pages
          <volume>151</volume>
          {
          <fpage>170</fpage>
          . Springer New York, New York, NY,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Ozeki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nakamura</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ohta</surname>
          </string-name>
          .
          <article-title>Video editing based on behaviors-for-attention - an approach to professional editing using a simple scheme</article-title>
          .
          <source>In 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763)</source>
          , volume
          <volume>3</volume>
          , pages
          <fpage>2215</fpage>
          <lpage>{</lpage>
          2218 Vol.
          <volume>3</volume>
          ,
          <string-name>
            <surname>June</surname>
          </string-name>
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Abhishek</given-names>
            <surname>Ranjan</surname>
          </string-name>
          , Jeremy Birnholtz, and
          <string-name>
            <given-names>Ravin</given-names>
            <surname>Balakrishnan</surname>
          </string-name>
          .
          <article-title>Improving meeting capture by applying television production principles with audio and motion detection</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI '08</source>
          , pages
          <fpage>227</fpage>
          {
          <fpage>236</fpage>
          , New York, NY, USA,
          <year>2008</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Tim J.</given-names>
            <surname>Smith.</surname>
          </string-name>
          <article-title>The attentional theory of cinematic continuity</article-title>
          .
          <source>Projections</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>27</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Hui-Yin Wu</surname>
            and
            <given-names>Marc</given-names>
          </string-name>
          <string-name>
            <surname>Christie</surname>
          </string-name>
          .
          <article-title>Stylistic Patterns for Generating Cinematographic Sequences</article-title>
          .
          <source>In 4th Workshop on Intelligent Cinematography and Editing Co-Located w/ Eurographics</source>
          <year>2015</year>
          , pages
          <fpage>47</fpage>
          {
          <fpage>53</fpage>
          ,
          <string-name>
            <surname>Zurich</surname>
          </string-name>
          , Switzerland, May
          <year>2015</year>
          . Eurographics Association. The de nitive version is available at http://diglib.eg.org/.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [ZCSG16]
          <string-name>
            <given-names>Ke</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Wei-Lun</surname>
            <given-names>Chao</given-names>
          </string-name>
          , Fei Sha, and
          <string-name>
            <given-names>Kristen</given-names>
            <surname>Grauman</surname>
          </string-name>
          .
          <article-title>Video summarization with long shortterm memory</article-title>
          . In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,
          <source>Computer Vision { ECCV</source>
          <year>2016</year>
          , pages
          <fpage>766</fpage>
          {
          <fpage>782</fpage>
          ,
          <string-name>
            <surname>Cham</surname>
          </string-name>
          ,
          <year>2016</year>
          . Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Yong</given-names>
            <surname>Jae Lee</surname>
          </string-name>
          and
          <string-name>
            <given-names>Kristen</given-names>
            <surname>Grauman</surname>
          </string-name>
          .
          <article-title>Predicting important objects for egocentric video summarization</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>114</volume>
          (
          <issue>1</issue>
          ):
          <volume>38</volume>
          {
          <fpage>55</fpage>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [LRGC01] Qiong Liu, Yong Rui, Anoop Gupta, and
          <string-name>
            <given-names>JJ</given-names>
            <surname>Cadiz</surname>
          </string-name>
          .
          <article-title>Automating camera management for lecture room environments</article-title>
          .
          <source>In Proceedings of SIGCHI'01</source>
          , volume
          <volume>3</volume>
          , pages
          <fpage>442</fpage>
          {
          <fpage>449</fpage>
          ,
          <string-name>
            <surname>March</surname>
          </string-name>
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [MCK+05]
          <string-name>
            <given-names>Iain</given-names>
            <surname>Mccowan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J Carletta</surname>
          </string-name>
          , Wessel Kraaij, Simone Ashby,
          <string-name>
            <given-names>S</given-names>
            <surname>Bourban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Flynn</surname>
          </string-name>
          ,
          <string-name>
            <surname>M Guillemot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Hain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J</given-names>
            <surname>Kadlec</surname>
          </string-name>
          , Vasilis Karaiskos,
          <string-name>
            <surname>M Kronenthal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Guillaume</given-names>
            <surname>Lathoud</surname>
          </string-name>
          , Mike Lincoln, Agnes Lisowska Masson, Wilfried Post,
          <string-name>
            <given-names>D</given-names>
            <surname>Reidsma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P</given-names>
            <surname>Wellner</surname>
          </string-name>
          .
          <article-title>The ami meeting corpus</article-title>
          .
          <source>In International Conference on Methods and Techniques in Behavioral Research</source>
          , page
          <volume>702</volume>
          pp.,
          <string-name>
            <surname>Wageningen</surname>
          </string-name>
          , Netherlands,
          <volume>01</volume>
          <fpage>2005</fpage>
          . Wageningen: Noldus Information Technology.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [MLZL02]
          <string-name>
            <surname>Yu-Fei</surname>
            <given-names>Ma</given-names>
          </string-name>
          , Lie Lu,
          <string-name>
            <surname>Hong-Jiang Zhang</surname>
            , and
            <given-names>Mingjing</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>A user attention model for video summarization</article-title>
          .
          <source>In Proceedings of the Tenth ACM International Conference on Multimedia, MULTIMEDIA '02</source>
          , pages
          <fpage>533</fpage>
          {
          <fpage>542</fpage>
          , New York, NY, USA,
          <year>2002</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Hea05] [JY11] [LG15] [ONO04] [RBB08] [Smi12] [WC15] [LDTA17] Mackenzie Leake</source>
          , Abe Davis,
          <string-name>
            <given-names>Anh</given-names>
            <surname>Truong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Maneesh</given-names>
            <surname>Agrawala</surname>
          </string-name>
          .
          <article-title>Computational video editing for dialogue-driven scenes</article-title>
          .
          <source>In Proceedings of SIGGRAPH</source>
          <year>2017</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>