<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automated Analysis of the Framing of Faces in a Large Video Archive</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ronan Forman ronan.forman@bbc.co.uk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig Wright craig.wright@bbc.co.uk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Evans michael.evans@bbc.co.uk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>BBC Research</string-name>
          <email>graeme.phillipson@bbc.co.uk</email>
          <email>mark.woosey@bbc.co.uk</email>
          <email>stephen.jolly@bbc.co.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Development</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Graeme Phillipson</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Stephen Jolly</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automated editing systems require an understanding of how subjects are typically framed, and how framing in one shot relates to another. In this paper we present an automated analysis of the framing of faces within a large video archive. These results demonstrate that the rule of thirds alone is insu cient to describe framing that is typical in drama, and we show that the framing of one shot has an e ect on that of the next.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Automated editing systems [LC12][GRG14][MBC14][GRLC15][LDTA17] could enable broadcasters to provide
coverage of more live events (such as music and arts festivals) where the cost of additional outside broadcast
units would be prohibitive [WAC+18]. Constructing such systems requires an understanding of how to frame
and sequence video. To frame video, systems often apply the rule of thirds, aligning faces on the dividing lines
between the vertical and horizontal thirds [LC12][ST11]. More sophisticated approaches have been used, but
these require large amounts of manually annotated data [SC14]. There is empirical evidence for the validity of
the rule of thirds. However, this evidence also suggests that the rule does not fully explain how faces are framed
[Cut15][WGLC17]. Additionally, it does not describe how framing in one shot relates to the next. In this paper
we present an initial automated analysis of a large quantity of archive data, in contrast to previous investigations
relying on human annotation. Manually-annotated data is assumed to be of a higher quality, and o ers greater
exibility in what can be annotated. However, automated annotation is scalable to larger quantities of data,
that may allow for more precise quantitative measures.
Copyright c by G. Phillipson, R. Forman, M. Woosey, C. Wright, M. Evans, S. Jolly. Copying permitted for private and academic
purposes.
0.25
0.20
ilty
iab0.15
rPob
0.10
0.05
0.000.0
2.5
5.0
7.5 Shotlengthseconds 12.5
10.0
15.0
17.5</p>
      <p>20.0
2018 in 16: 9 aspect ratio. Each was conformed to a resolution of 1024 576 before analysis. The rst and last
5 minutes were trimmed from each show to remove trailers and title/credit sequences that may contain faces.
Those faces might otherwise be found many times in the dataset and bias the results. The videos were split
into discrete shots with mpeg 1. The middle frame of each shot was extracted and assumed to be representative
of the shot as a whole. We have not considered developing or action shots in this analysis, and they must be
assumed to be adding some noise to the overall results. Shots shorter than 0:5s and longer than 20s were ltered
out as they are likely to be the result of either false positive or negative shot change detections, or shots framed
with subjects other than static faces in mind. The locations of the faces and their landmarks (e.g. the eyes) were
found using the SeetaFace library [LKW+16]. Seetaface was chose because it's accuracy had been validated on
this archive[IRF], which is important as not all o the shelf computer vision techniques generalise well enough to
work across such a large archive. It is worth noting that SeetaFace will not detect partial faces, so we would not
expect detections towards the very edge of the screen, where part of the face may be outside the visible frame.
The centre of the face was taken to be the mid point between the eyes. 3; 567; 433 faces were found in total.
3
3.1</p>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>Shot distribution
The probability of di erent shot lengths can be seen in Fig.1. The mean shot length was 3.975s. The distribution
shows a preference for shorter shots in most of the archive.
3.2</p>
      <p>Head Position In All Shots
In Fig.2, the probability distribution of faces occurring at di erent locations within a shot is estimated across
all the shots using Kernel Density Estimation [Sco15]. The vertical distribution shows a clear preference for the
face to occur on the upper third line. The horizontal distribution shows a preference for faces to be within the
middle third, particularly just inside the thirds lines, with a small preference for being on the left.
3.3</p>
      <p>Head Position for Shots with Di erent Numbers of People
In Fig.3a the frequency of occurrence of faces in shots containing only one person is shown. There is a preference
for the middle upper third line with two clusters at either end of this. There is also an asymmetric cluster to
the right of and below the main cluster. Manual inspection of the shots responsible for this cluster shows that it
is due to the presence of a overlaid sign language interpreter in a proportion of these shows, and whose face is
1https://www. mpeg.org/
(a) The density of faces across the horizontal axis.
(b) The density of faces across the vertical axis
in approximately the same location in all of them. Fig.3b shows the same distribution for shots with two people
in them. Here the average framing is slightly higher, and the two main clusters are spaced further apart.
3.4</p>
      <p>Relationships Between Consecutive Shots
The relative framing of faces in two consecutive shots (where both shots contain only a single face) is illustrated
in Fig.4. Given a face in a particular horizontal position (the x-axis) on the upper third line, the distribution
of horizontal positions of faces in subsequent shots is as shown on the y-axis. For example, given a face located
at 400px horizontally in one shot, the most likely position for the face in the subsequent shot is around 600px.
This was calculated by storing all of the face detection locations in a KD-Tree [MM99], then walking a point
across the upper third line. For each location on the line, all face detections within 10px were retrieved and
the index of shots was used to nd the locations of faces in the next shot. Kernel density estimation was then
used to produce the conditional probability distribution of horizontal location in the next shot given the current
horizontal position. The results show that when a face is in the left cluster it is likely that it will subsequently
appear in the right cluster and vice versa.
3.5</p>
      <p>Distribution of Face Sizes
The distribution of face sizes was calculated by taking the face landmarks produced by SeetaFace (the eyes,
nose, and two corners of the mouth) and nding the convex hull of these points.[BDH96] The area of this convex
hull was calculated and the distribution of this can be seen in Fig.5. Most production use a semi-standardised
language to describe shots as being a "Close-up", "Mid-shot", "Long shot", etc. [ST11] The shots are de ned
in terms of where on the body the bottom of the screen cuts. If face area was strongly correlated with shot
type then there might be a multi-modal distribution which could be used to estimate shot-type. However in
Fig.5 we can see that while it is clearly not a single distribution, the overlap is too much to allow for shot type
40000
35000
30000
25000
tcFoenu
ca20000
15000
10000
5000
The results show that while the rule of thirds is important, there are deviations from it (such as the most likely
face locations being slightly inside the lines for single shots, but on those lines for two-shots) which require large
datasets in order to quantify.</p>
      <p>Previous work has shown that single shots have a single centrally-framed cluster [Cut15][WGLC17] rather than
the bimodal distribution demonstrated here. The bimodal distribution combined with the oscillations shown
for the conditional probability of framing in consecutive shots suggests extensive use of the shot/reverse-shot
pattern often used in dialogue [ST11]. The previous work concentrated on lm, where as here we are examining
television drama, and this di erence in result may simply re ect how often the shot-reverse-shot pattern is used
in these di erent media.</p>
      <p>Expanding this work to analyse subjects other than faces is di cult, due to the lack of labelled data for this
dataset to validate models other than simple face location. Particularly it is important to validate models on
labeled data from this archive, as many open source systems were not trained on broadcast media. However, mass
data labelling services [PBSA17] may provide a way to produce enough labeled data to validate other methods.
This would allow visual features such as the framing of the whole body [RAG18][CSWS17][SJMS17][WRKS16]
or salient non-human objects [CBSC18] to be investigated. Pose estimation would allow for investigation of the
relationship between framing and the direction the direction of gaze. Dense pose estimation [RAG18] might be
particularly useful as shot type are normally discussed in terms of where on the body the bottom of the frame
cuts on the body which we would be able to calculate from this. Additionally this would allow the detection of
people not facing the camera and, in turn, enable the detection of over the shoulder shots.
[BDH96]
[CBSC18] Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. Predicting Human Eye
Fixations via an LSTM-based Saliency Attentive Model. IEEE Transactions on Image Processing,
2018.
[CSWS17] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation
using part a nity elds. In CVPR, 2017.</p>
      <p>James E. Cutting. The framing of characters in popular movies. Art &amp; Perception, 3(2):191{212,
2015.</p>
      <p>Vineet Gandhi, Remi Ronfard, and Michael Gleicher. Multi-Clip Video Editing from a Single
Viewpoint. In CVMP 2014 - European Conference on Visual Media Production, page Article No. 9,
London, United Kingdom, November 2014. ACM.
[GRLC15] Quentin Galvane, Remi Ronfard, Christophe Lino, and Marc Christie. Continuity Editing for 3D
Animation. In AAAI Conference on Arti cial Intelligence, pages 753{761, Austin, Texas, United
States, January 2015. AAAI Press.</p>
      <p>Irfs weeknotes 243.</p>
      <p>Accessed: 2018-10-3.</p>
      <p>Christophe Lino and Marc Christie. E cient composition for virtual camera control. In Proceedings
of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA '12, pages 65{70,
Goslar Germany, Germany, 2012. Eurographics Association.
[LDTA17] Mackenzie Leake, Abe Davis, Anh Truong, and Maneesh Agrawala. Computational video editing for
dialogue-driven scenes. ACM Trans. Graph., 36(4):130:1{130:14, July 2017.
[LKW+16] Xin Liu, Meina Kan, Wanglong Wu, Shiguang Shan, and Xilin Chen. VIPLFaceNet: An open source
deep face recognition sdk. Frontiers of Computer Science, 2016.
[RAG18]
[SC14]
[Sco15]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Songrit</given-names>
            <surname>Maneewongvatana</surname>
          </string-name>
          and
          <string-name>
            <given-names>David M.</given-names>
            <surname>Mount</surname>
          </string-name>
          .
          <article-title>Analysis of approximate nearest neighbor searching with clustered point sets. CoRR, cs</article-title>
          .
          <source>CG/9901013</source>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [PBSA17]
          <string-name>
            <given-names>Eyal</given-names>
            <surname>Peer</surname>
          </string-name>
          , Laura Brandimarte, Sonam Samat, and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Acquisti</surname>
          </string-name>
          .
          <article-title>Beyond the turk: Alternative platforms for crowdsourcing behavioral research</article-title>
          .
          <source>Journal of Experimental Social Psychology</source>
          ,
          <volume>70</volume>
          :
          <fpage>153</fpage>
          {
          <fpage>163</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Iasonas</given-names>
            <surname>Kokkinos Riza Alp</surname>
          </string-name>
          <article-title>Guler, Natalia Neverova</article-title>
          . Densepose:
          <article-title>Dense human pose estimation in the wild</article-title>
          . arXiv,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Cunka</given-names>
            <surname>Sanokho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Christie</surname>
          </string-name>
          .
          <article-title>On-screen visual balance inspired by real movies</article-title>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>David W.</given-names>
            <surname>Scott</surname>
          </string-name>
          . Multivariate Density Estimation: Theory, Practice, and Visualization,
          <string-name>
            <given-names>Second</given-names>
            <surname>Edition</surname>
          </string-name>
          . Wiley,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [SJMS17]
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Simon</surname>
          </string-name>
          , Hanbyul Joo, Iain Matthews, and
          <string-name>
            <given-names>Yaser</given-names>
            <surname>Sheikh</surname>
          </string-name>
          .
          <article-title>Hand keypoint detection in single images using multiview bootstrapping</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [ST11]
          <string-name>
            <given-names>Roger</given-names>
            <surname>Singelton-Turner. Cue</surname>
          </string-name>
          &amp; Cut. Manchester University Press,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [WAC+18]
          <string-name>
            <surname>Craig</surname>
            <given-names>Wright</given-names>
          </string-name>
          , Jack Allnut, Rosie Campbell,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Evans</surname>
          </string-name>
          , Stephen Jollyand Lianne Kerlin, James Gibson, Graeme Phillipson, and
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Shotton</surname>
          </string-name>
          .
          <article-title>Ai in production: Video analysis and machine learning for expanded live events coverage</article-title>
          .
          <source>Proceedings of the International Broadcasting Convention</source>
          ,
          <year>Sept 2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [WGLC17]
          <string-name>
            <surname>Hui-Yin</surname>
            <given-names>Wu</given-names>
          </string-name>
          , Quentin Galvane, Christophe Lino, and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Christie</surname>
          </string-name>
          .
          <article-title>Analyzing elements of style in annotated lm clips</article-title>
          .
          <source>In WICED 2017 - Eurographics Workshop on Intelligent Cinematography and Editing</source>
          , pages
          <volume>29</volume>
          {
          <fpage>35</fpage>
          ,
          <string-name>
            <surname>Lyon</surname>
          </string-name>
          , France,
          <year>April 2017</year>
          .
          <article-title>The Eurographics Association</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [WRKS16]
          <string-name>
            <surname>Shih-En</surname>
            <given-names>Wei</given-names>
          </string-name>
          , Varun Ramakrishna, Takeo Kanade, and
          <string-name>
            <given-names>Yaser</given-names>
            <surname>Sheikh</surname>
          </string-name>
          .
          <article-title>Convolutional pose machines</article-title>
          .
          <source>In CVPR</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>