<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Annotation and Automated Extraction of Audio-Visual Staging Patterns in Large-Scale Empirical Film Studies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Henning Agt-Rickauer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Hentschel</string-name>
          <email>christian.hentschelg@hpi.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <email>harald.sack@fiz-karlsruhe.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Karlsruhe Institute of Technology</institution>
          ,
          <addr-line>Karlsruhe</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Hasso Plattner Institute for IT Systems Engineering, University of Potsdam</institution>
          ,
          <addr-line>Potsdam</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The study of audio-visual rhetorics of a ect scienti cally analyses the impact
of auditory and visual staging patterns on the perception of media productions
as well as the conveyed emotions. In the AdA-project3, together with lm
scientists, we aim to follow the hypothesis of TV reports drawing on audio-visual
patterns in cinematographic productions to emotionally a ect viewers by
largescale corpus analysis of TV reports, documentaries and genre- lms of the topos
\ nancial crisis". As it can be observed in the past and current media coverage
of the world-wide nancial crisis, TV reports often employ highly
emotionalizing staging strategies in order to convey a certain message to the audience.
This is also true for public broadcasting agencies who claim to adopt
journalistic objectivity. This project therefore aims to make transparent this hard to
grasp opinion-forming level of audio-visual reporting and follows the hypothesis
of audio-visual staging patterns always aiming at coining emotional attitudes.</p>
      <p>
        So far, localization and description of these patterns is currently limited to
micro-studies due to the involved extremely high manual annotation e ort [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
We therefore pursue two main objectives: 1) creation of a standardized
annotation vocabulary to be applied for semantic annotations and 2) semi-automatic
classi cation of audio-visual patterns by training models on manually
assembled ground truth annotation data. The annotation vocabulary for empirical
lm studies and semantic annotations of audio-visual material based on Linked
Open Data principles enables the publication, reuse, retrieval, and visualization
of results from lm-analytical methods. Furthermore, automatic analysis of video
streams allows to speed up the process of extracting audio-visual patterns.
      </p>
      <p>This paper will focus on describing the semantic data management of the
project and the developed vocabulary for ne-grained semantic video annotation.
Furthermore, we will give a short outlook on how we aim to integrate machine
learning into the process of automatically detecting audio-visual patterns.</p>
      <sec id="sec-1-1">
        <title>3 AdA-project | http://www.ada.cinepoetics.fu-berlin.de</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Tool-Supported Empirical Film Studies</title>
      <p>
        The systematic empirical study of audio-visual patterns in feature lms,
documentaries and TV reports requires a digitally supported methodology to produce
consistent, open and reusable data. The project relies on tool-based video
annotation and data management strictly following the Linked Open Data principles.
A recent survey on available video annotation tools that can output RDF data
can be found in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Advene [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] was chosen for this project as annotation
software because it meets best the needs of lm scientists: It o ers timeline view and
segment based annotations of video sequences using multiple tracks which can
be used to annotate various aspects under which a video is analyzed. We also
collaborate with the author of Advene to develop project-speci c extensions.
      </p>
      <sec id="sec-2-1">
        <title>4 http://rml.io/index.html 5 http://ada.filmontology.org/</title>
        <p>Semantic Annotation and Automated Extraction of Audio-Visual Patterns
video analysis to reduce the need for elaborate manual annotations (analysis in
the background or triggered from the Advene software, see Sect. 4).
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Vocabulary for Fine-Grained Semantic Video</title>
    </sec>
    <sec id="sec-4">
      <title>Annotation</title>
      <p>
        Movies are annotated on a speci c topic using an annotation method (eMAEX6)
developed by and especially for lm scientists. eMAEX enables a precise
description of the cinematographic image in its temporal development, which leads to
several hundred annotations per scene. One goal of the project is to make these
annotations accessible as Linked Open Data for exchange and comparison of
analysis data. Therefore we have developed an ontology data model that uses
the latest Web Annotation Vocabulary7 to express annotations and Media
Fragments URI8 for timecode-based referencing of video material. The lm scientists
provide domain-speci c knowledge for the development of a systematic
vocabulary of lm-analytical concepts and terms (the AdA Ontology), as existing
ontologies such as LSCOM9, COMM10, MediaOnt11, VidOnt [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] do not provide
the level of detail required to describe all the low-level features of audio-visual
content in the project according to scienti c lm-analytical standards.
      </p>
      <p>Our vocabulary provides nine categories for annoation, called Annotation
Levels: segmentation, language, image composition, camera, montage,
acoustics, bodily expressivity, motifs, and other optional aspects. Each of the levels
contains several sub-aspects, referred to as Annotation Types. These types
correspond one-to-one with the tracks in a timeline view of the annotation
software. For example, analysis of image composition includes the annotation of
brightness, contrast, color accents, and visual patterns, and annotations at
camera level include several aspects of camera movement, such as type of movement,
speed, and direction. Each of these types has its own properties, the so-called
Annotation Values. The ontology provides, if applicable, a set of prede ned
annotation values for each annotation type. For example, the type of camera
movement can take pan, tilt, zoom, and other values. About 75% of the
annotation types provide prede ned values. The others contain free-text annotations,
such as descriptions of scene settings or dialogue transcriptions.</p>
      <p>The current version of the AdA ontology includes 9 annotation levels, 79
annotation types and 434 prede ned annotation values. We provide an online
version of the ontology under http://ada.filmontology.org/ and a download
possibility under the project's GitHub page12.
6 Electronically-based Media Analysis of EXpressive movements | https://bit.ly/
2K8i368
7 https://www.w3.org/TR/annotation-vocab/
8 https://www.w3.org/TR/media-frags/
9 http://vocab.linkeddata.es/lscom/
10 http://multimedia.semanticweb.org/COMM/
11 https://www.w3.org/TR/mediaont-10/
12 https://github.com/ProjectAdA/public</p>
    </sec>
    <sec id="sec-5">
      <title>Classi cation of Audio-Visual Patterns</title>
      <p>
        In order to speed up the annotation process, we apply computer vision and
machine learning techniques for generating annotation data on unseen material.
As a rst step, all videos in the corpus are automatically segmented into shots
based on video cuts (hard-cuts, fades and dissolves [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Individual shots are
represented by one or more key-frames, depending on the shot length. Key-frames
are used for further analysis of the visual video content. Colorspace analysis helps
to identify important aspects about the image composition such as the
dominating color palette, diverging salient color accents as well the overall brightness
of a shot. Optical ow analysis can be used to classify camera movement into
static, zoom, tilt and pan. Other important aspects such as the video content are
harder to grasp. Visual concepts depicted in a video segment such as landscape,
person or skyscraper have been successfully classi ed using deep convolution
neural networks trained on large amount of manually labeled image data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Since
for this project the amount of training data is limited, we use transfer learning
approaches to ne-tune a pretrained neural network to our target domain [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>All automatically derived annotations will be published as RDF based on
the Media Fragments standard. In order to distinguish them from manually
generated annotations, provenience information and con dence scores are added.
Acknowledgments. This work is partially supported by the Federal Ministry
of Education and Research under grant number 01UG1632B.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Aubert</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prie</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Advene: an open-source framework for integrating and visualising audiovisual metadata</article-title>
          .
          <source>In: Proceedings of the 15th ACM international conference on Multimedia</source>
          . pp.
          <volume>1005</volume>
          {
          <fpage>1008</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hentschel</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hercher</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knuth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , et al.:
          <article-title>Open Up Cultural Heritage in Video Archives with Mediaglobe</article-title>
          .
          <source>In: Proceeding of the 12th International Conference on Innovative Internet Community Systems (I2CS</source>
          <year>2012</year>
          ). vol.
          <volume>204</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>ImageNet Classi cation with Deep Convolutional Neural Networks</article-title>
          .
          <source>Advances In Neural Information Processing Systems</source>
          pp.
          <volume>1097</volume>
          {
          <issue>1105</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Scherer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greifenstein</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kappelho</surname>
          </string-name>
          , H.:
          <article-title>Expressive movements in audiovisual media: Modulating a ective experience</article-title>
          . In: Muller,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Cienki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Fricke</surname>
          </string-name>
          ,
          <string-name>
            <surname>E.</surname>
          </string-name>
          , et al. (eds.) Body {
          <article-title>Language { Communication. An international handbook on multimodality in human interaction</article-title>
          , pp.
          <year>2081</year>
          {
          <year>2092</year>
          . De Gruyter Mouton, Berlin, New York (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sikos</surname>
            ,
            <given-names>L.F.</given-names>
          </string-name>
          :
          <article-title>Rdf-powered semantic video annotation tools with concept mapping to linked data for next-generation video indexing: a comprehensive review</article-title>
          .
          <source>Multimedia Tools and Applications</source>
          <volume>76</volume>
          (
          <issue>12</issue>
          ),
          <volume>14437</volume>
          {
          <fpage>14460</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sikos</surname>
            ,
            <given-names>L.F.</given-names>
          </string-name>
          :
          <article-title>Vidont: a core reference ontology for reasoning over video scenes</article-title>
          .
          <source>Journal of Information and Telecommunication</source>
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <volume>192</volume>
          {
          <fpage>204</fpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Yosinski</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clune</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lipson</surname>
          </string-name>
          , H.:
          <article-title>How transferable are features in deep neural networks?</article-title>
          <source>Advances in Neural Information Processing Systems</source>
          pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>