<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Indexing Camera Motion Integrating Knowledge of the Quality of the Encoded Video</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>P. Krämer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J. Benois-Pineau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>member IEEE</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Gràcia Pla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>P. Krämer and J. Benois-Pineau are with LABRI UMR CNRS/University of Bordeaux 1/Enseirb/INRIA laboratory</institution>
          ,
          <addr-line>351, crs de la Libération, 33405 Talence Cedex, France; petra.kraemer</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>-Fast indexing of video contents in the compressed domain has become an important task as growing quantities of multimedia (MM) digital content are available in this form. In this paper we present a method for fast indexing of camera motion of MPEG1 and 2 compressed video. We use P-frame motion vectors and extract some knowledge on the quality of the compensated motion from the compressed stream. It is then used for decision making on the motion refinement. Then camera motion is indexed in terms of physical motions. Results obtained on the TREC Video test data set are interesting.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Index Terms— video indexing, camera motion, compressed
streams.</p>
    </sec>
    <sec id="sec-2">
      <title>I. INTRODUCTION</title>
      <p>Indexing and annotating large quantities of films and video
material has become an increasing problem for the media
industry. Today, indexing for large application areas such as
broadcast, archives, and home MM devices definitely follows
MPEG7 – the compliant way. This is a standard [1] for
describing the multimedia content. For visual media, it defines
descriptors to characterize the content on a visual basis. In
video, which intrinsic property is motion, it proposes motion
descriptors. Nevertheless, MPEG7 does not give hints on how
to produce a standard compliant description of e.g. camera
motion, and how to translate this description into features
easily interpreted by humans such as tilt, zoom, or pan… A lot
of multimedia content is already available in compressed form.
Furthermore, a digitization of the existing video content and
digital production of new content are today unthinkable
without compression. Thus a lot of work [2 – 4] has been
devoted to the estimation of the camera model from motion
vectors contained in the compressed stream. This work is
another step forward in the general framework which we call
“Rough Indexing Paradigm” and has been developed since [5].
A whole lot of indexing tasks such as shot boundary detection,
scene grouping, video summarization, video object extraction,
or motion characterization can be fulfilled on degraded and
low-resolution/low-level data produced by encoding video
streams with current encoders (MPEG1, 2, H.264 …). We
claim that a compressed stream is a rich source of input data
for indexing and this is only the matter of interpretation for the
intelligent use of it. In this paper we show how we can truly
use not only MPEG (1 or 2) motion vectors, but also the
information on the quality of their estimation in order to
estimate the camera model (Section 2) and to qualify motion in
the humanly interpretable way (Section 3). This is for instance
a task of camera motion characterization in TREC Video
2005, where we did participate. We show how this knowledge
helps us to improve the indexing results and give the
perspectives of this work (Section 4).</p>
      <p>II. GLOBAL MOTION ESTIMATION AND CORRECTION FROM</p>
      <p>MPEG COMPRESSED VIDEO
In this section we address the problem of estimating the global
(camera model) in a video sequence. Here we use motion
compensation vectors from P-frames. In order to remain the
same temporal resolution and get a smooth motion trajectory,
we interpolate it for I-frames. Finally, as MPEG motion
vectors are not computed for analysis purposes, but for optimal
encoding, they can be very much erroneous (e.g. in case of
strong motion), we propose how to detect such encoder
failures and how to correct the motion.</p>
      <p>II.1 Global motion estimation from P-frames</p>
      <p>Here we rely on our previous work [5] and use a 6
parameter affine camera model. We suppose [5] that an MPEG
macro-block displacement vector is expressed as:</p>
      <p>
         ddyx  =  aa12  +  aa52 aa36  yx −− xygg  (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where a1 ,..., a6 are the global motion parameters of camera
and ( xg , y g )T denotes the image center. The estimation by a
robust estimator that we proposed in [5], allows classifying
macro-blocks (MBs) as conformant to the model, what we call
the “dominant estimation support”, or outliers. The latter
contain intra-coded MBs, MBs in moving objects and in
occluding areas. This approach supposes that in a current
Pframe, there are motion vectors, which express the apparent
camera motion. Unfortunately this is not always the case. In
order to re-cover the real camera motion in such frames it is
necessary to detect encoder failures and to correct the motion.
II.2 Detection of frames with low–quality motion and motion
correction
      </p>
      <p>If the MPEG encoder motion estimator failed, the motion
compensation error encoded in the MPEG stream is strong.
Such failures are very much dependent on the parameter
settings of the encoder and are specifically observed in the
case of strong motion (e.g. soccer content).</p>
      <p>We compute the mean low frequency energy Et on the
dominant estimation support Dt i.e. excluding the motion
outliers:</p>
      <p>
        Et =
1
Dt p∈Dt
∑ DCPerr ( p, t)2
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
Here DCPerr ( p, t ) are the DC coefficients extracted from the
encoded error in P-frames.
      </p>
      <p>
        To take the decision if the motion model has to be
corrected, we use the temporal mean γ t of (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ). If the
instantaneous value of (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) exceeds αγ t , with α ≥ 1 then the
motion will be corrected.
      </p>
      <p>To fulfill this correction we first interpolate the motion
model from neighboring P-frames by a linear regression. This
interpolation is used as the initialization of the model estimate
in the gradient descent scheme.</p>
      <p>Here we minimize the functional of the mean square error of
the motion compensation at DC resolution on the dominant
estimation support:</p>
      <p>
        1 r 2
MSEt = Dt p∑∈Dt (It ( p ) − It −1( p + d )) (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
The optimization is done in the parameter space by gradient
descent:
Θit+1 = Θit −
      </p>
      <p>ε
2 Dt</p>
      <p>
        G i
with G as the gradient of (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) and ε
matrix.
      </p>
      <p>as the adaptive gain</p>
    </sec>
    <sec id="sec-3">
      <title>III. CAMERA MOTION INDEXING</title>
      <p>
        The objective here is to translate the motion model (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) into
physical motion, interpretable by humans, such as pan, tilt, or
zoom. To do this we follow [6] and reformulate the model (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
as:
 dx  pan  zoom⋅ x − rot ⋅ y + hyp1⋅ x + hyp2 ⋅ y 
  =   +  
 dy  tilt   zoom⋅ y + rot ⋅ x − hyp1⋅ y + hyp2⋅ y 
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
      </p>
      <p>Then two statistical hypotheses are tested on each parameter
of this model. The first one H 0 consists in supposing that the
parameter is significant, the second one H1 assumes that the
component is not significant, i.e. equals zero.</p>
      <p>
        The likelihood function f for each hypothesis is defined with
respect to the residuals between the estimated model and the
MPEG motion vectors. These residuals are supposed to follow
the bi-variate Gaussian law. The decision on the significance is
made by a comparison of the log-likelihood ratio with a
threshold. We used this scheme in our previous work, but in
case of the knowledge on a bad estimation that is available
from (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), we do not compute residuals between the erroneous
MPEG motion vectors and those obtained by the re-estimated
model. The interpolated parameters are used as reference (light
correction) in this case.
      </p>
    </sec>
    <sec id="sec-4">
      <title>IV. RESULTS AND CONCLUSION</title>
      <p>
        To assess the improvement due to the proposed integration
of the knowledge on erroneous motion and re-estimation of
motion (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), we conducted experiments on the evaluation set of
the TREC Video camera motion task
http://wwwnlpir.nist.gov/projects/trecvid/ in which we participated in
2005. A subset of 4 videos containing visually observable
motion was chosen. Using α = 4.0 in the decision rule, about
4% of the P-frame motion is corrected. Due to this correction
we obtain a mean precision of 76% and a mean recall of
86.1%. Without the correction 74.5% and 78.7% are obtained
respectively. We have to stress that the increase of recall of 8
% is already very much significant for this task.
      </p>
      <p>Hence in this paper we proposed a new method for motion
correction when estimating and indexing camera motion from
compressed (MPEG1 and MPEG2) video streams.</p>
      <p>We tested it for indexing purposes on the MPEG1
compressed TREC Video test set. For video summarizing by
mosaicing from compressed streams and for other indexing
applications (shot boundary detection, object extraction) we
work on MPEG2 compressed streams as well. There is no
principal difference and the method reveals promising for the
whole Rough Indexing Paradigm, we continue developing on
compressed streams.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] MPEG-7 Requirements Document V.7: Coding of Moving Pictures</article-title>
          and Audio
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Saez</surname>
          </string-name>
          et al., “
          <article-title>Global motion estimation algorithm for video segmentation”</article-title>
          ,
          <source>Proc. SPIE, VCIP'03</source>
          , pp.
          <fpage>1540</fpage>
          -
          <lpage>1550</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ewerth</surname>
          </string-name>
          et al. “
          <article-title>Estimation of arbitrary camera motion in {MPEG} videos”</article-title>
          ,
          <source>Proc. ICPR'04</source>
          , pp.
          <fpage>512</fpage>
          -
          <lpage>515</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Doulaverakis</surname>
          </string-name>
          et al. ,
          <article-title>“Adaptive Methods for Motion Characterization and Segmentation of MPEG Compressed Frame Sequences”</article-title>
          ,
          <source>Proc. ICIAR'04</source>
          , pp.
          <fpage>310</fpage>
          -
          <lpage>317</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Durik</surname>
          </string-name>
          et al, “
          <source>Robust Motion Characterisation for Video Indexing based on Optical Flow” Proc. CBMI'01</source>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>64</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bouthemy</surname>
          </string-name>
          et al. “
          <article-title>A unified approach to shot change detection and camera motion characterization”</article-title>
          ,
          <source>IEEE Trans. on CSVT</source>
          ,
          <volume>9</volume>
          (
          <issue>7</issue>
          ), pp.
          <fpage>1030</fpage>
          -
          <lpage>1044</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>