Indexing Camera Motion Integrating
        Knowledge of the Quality of the Encoded Video
                                        P. Krämer, J. Benois-Pineau, member IEEE, M. Gràcia Pla


                                                                               claim that a compressed stream is a rich source of input data
   Abstract—Fast indexing of video contents in the compressed                  for indexing and this is only the matter of interpretation for the
domain has become an important task as growing quantities of                   intelligent use of it. In this paper we show how we can truly
multimedia (MM) digital content are available in this form. In                 use not only MPEG (1 or 2) motion vectors, but also the
this paper we present a method for fast indexing of camera
motion of MPEG1 and 2 compressed video. We use P-frame
                                                                               information on the quality of their estimation in order to
motion vectors and extract some knowledge on the quality of the                estimate the camera model (Section 2) and to qualify motion in
compensated motion from the compressed stream. It is then used                 the humanly interpretable way (Section 3). This is for instance
for decision making on the motion refinement. Then camera                      a task of camera motion characterization in TREC Video
motion is indexed in terms of physical motions. Results obtained               2005, where we did participate. We show how this knowledge
on the TREC Video test data set are interesting.                               helps us to improve the indexing results and give the
                                                                               perspectives of this work (Section 4).
   Index Terms— video indexing, camera motion, compressed
streams.

                                                                                 II. GLOBAL MOTION ESTIMATION AND CORRECTION FROM
                          I. INTRODUCTION                                                    MPEG COMPRESSED VIDEO
   Indexing and annotating large quantities of films and video                 In this section we address the problem of estimating the global
material has become an increasing problem for the media                        (camera model) in a video sequence. Here we use motion
industry. Today, indexing for large application areas such as                  compensation vectors from P-frames. In order to remain the
broadcast, archives, and home MM devices definitely follows                    same temporal resolution and get a smooth motion trajectory,
                                                                               we interpolate it for I-frames. Finally, as MPEG motion
MPEG7 – the compliant way. This is a standard [1] for
                                                                               vectors are not computed for analysis purposes, but for optimal
describing the multimedia content. For visual media, it defines
                                                                               encoding, they can be very much erroneous (e.g. in case of
descriptors to characterize the content on a visual basis. In
                                                                               strong motion), we propose how to detect such encoder
video, which intrinsic property is motion, it proposes motion                  failures and how to correct the motion.
descriptors. Nevertheless, MPEG7 does not give hints on how
to produce a standard compliant description of e.g. camera                      II.1 Global motion estimation from P-frames
motion, and how to translate this description into features                      Here we rely on our previous work [5] and use a 6
easily interpreted by humans such as tilt, zoom, or pan… A lot                 parameter affine camera model. We suppose [5] that an MPEG
of multimedia content is already available in compressed form.                 macro-block displacement vector is expressed as:
Furthermore, a digitization of the existing video content and
                                                                                  dx   a1   a 2 a3  x − x g 
digital production of new content are today unthinkable                            =   +                                   (1)
                                                                                                                 y − y 
without compression. Thus a lot of work [2 – 4] has been                             2  5
                                                                                   dy       a         a     a 6       g 
devoted to the estimation of the camera model from motion
vectors contained in the compressed stream. This work is                       where a1 ,..., a 6 are the global motion parameters of camera
another step forward in the general framework which we call                                   T
                                                                               and ( x g , y g ) denotes the image center. The estimation by a
“Rough Indexing Paradigm” and has been developed since [5].
                                                                               robust estimator that we proposed in [5], allows classifying
A whole lot of indexing tasks such as shot boundary detection,
                                                                               macro-blocks (MBs) as conformant to the model, what we call
scene grouping, video summarization, video object extraction,
                                                                               the “dominant estimation support”, or outliers. The latter
or motion characterization can be fulfilled on degraded and
                                                                               contain intra-coded MBs, MBs in moving objects and in
low-resolution/low-level data produced by encoding video
                                                                               occluding areas. This approach supposes that in a current P-
streams with current encoders (MPEG1, 2, H.264 …). We
                                                                               frame, there are motion vectors, which express the apparent
                                                                               camera motion. Unfortunately this is not always the case. In
   P. Krämer and J. Benois-Pineau are with LABRI UMR CNRS/University           order to re-cover the real camera motion in such frames it is
of Bordeaux 1/Enseirb/INRIA laboratory, 351, crs de la Libération, 33405
Talence Cedex, France; petra.kraemer, jenny.benois@labri.fr; phone 33 5 40     necessary to detect encoder failures and to correct the motion.
00 84 24, fax 33 5 40 00 66 69. M. Gràcia Pla has been on master position in
LABRI on leave from UPC, Barcelona, Spain.
  II.2 Detection of frames with low–quality motion and motion         respect to the residuals between the estimated model and the
  correction                                                          MPEG motion vectors. These residuals are supposed to follow
   If the MPEG encoder motion estimator failed, the motion            the bi-variate Gaussian law. The decision on the significance is
compensation error encoded in the MPEG stream is strong.              made by a comparison of the log-likelihood ratio with a
Such failures are very much dependent on the parameter                threshold. We used this scheme in our previous work, but in
settings of the encoder and are specifically observed in the          case of the knowledge on a bad estimation that is available
case of strong motion (e.g. soccer content).                          from (2), we do not compute residuals between the erroneous
                                                                      MPEG motion vectors and those obtained by the re-estimated
  We compute the mean low frequency energy E t on the
                                                                      model. The interpolated parameters are used as reference (light
dominant estimation support Dt i.e. excluding the motion              correction) in this case.
outliers:
           1
   Et =
           Dt
                  ∑ DC ( p, t ) err
                                P
                                      2
                                                              (2)
                                                                                         IV. RESULTS AND CONCLUSION
                 p∈Dt
           err
                                                                         To assess the improvement due to the proposed integration
Here DCP ( p, t ) are the DC coefficients extracted from the          of the knowledge on erroneous motion and re-estimation of
encoded error in P-frames.                                            motion (3), we conducted experiments on the evaluation set of
  To take the decision if the motion model has to be                  the TREC Video camera motion task http://www-
corrected, we use the temporal mean              γ t of (2). If the   nlpir.nist.gov/projects/trecvid/ in which we participated in
                                                                      2005. A subset of 4 videos containing visually observable
instantaneous value of (2) exceeds αγ t , with α ≥ 1 then the         motion was chosen. Using α = 4.0 in the decision rule, about
motion will be corrected.                                             4% of the P-frame motion is corrected. Due to this correction
   To fulfill this correction we first interpolate the motion         we obtain a mean precision of 76% and a mean recall of
model from neighboring P-frames by a linear regression. This          86.1%. Without the correction 74.5% and 78.7% are obtained
interpolation is used as the initialization of the model estimate     respectively. We have to stress that the increase of recall of 8
in the gradient descent scheme.                                       % is already very much significant for this task.
   Here we minimize the functional of the mean square error of           Hence in this paper we proposed a new method for motion
the motion compensation at DC resolution on the dominant              correction when estimating and indexing camera motion from
estimation support:                                                   compressed (MPEG1 and MPEG2) video streams.
                 1                               r 2                     We tested it for indexing purposes on the MPEG1
   MSEt =
                 Dt
                        ∑ (I ( p ) − I ( p + d ) )
                                  t       t −1                (3)     compressed TREC Video test set. For video summarizing by
                        p∈D t                                         mosaicing from compressed streams and for other indexing
  The optimization is done in the parameter space by gradient         applications (shot boundary detection, object extraction) we
descent:                                                              work on MPEG2 compressed streams as well. There is no
                         ε                                            principal difference and the method reveals promising for the
   Θ it +1 = Θ it −              Gi                                   whole Rough Indexing Paradigm, we continue developing on
                      2 Dt                                            compressed streams.
with G as the gradient of (3) and ε as the adaptive gain
matrix.                                                                                             REFERENCES
                                                                      [1]   MPEG-7 Requirements Document V.7: Coding of Moving Pictures and
                                                                            Audio
                                                                      [2]   E. Saez et al., “Global motion estimation algorithm for video
                 III. CAMERA MOTION INDEXING                                segmentation”, Proc. SPIE, VCIP'03, pp. 1540-1550
   The objective here is to translate the motion model (1) into       [3]   R. Ewerth et al. “Estimation of arbitrary camera motion in {MPEG}
                                                                            videos”, Proc. ICPR'04, pp. 512-515
physical motion, interpretable by humans, such as pan, tilt, or       [4]   C. Doulaverakis et al. , “Adaptive Methods for Motion Characterization
zoom. To do this we follow [6] and reformulate the model (1)                and Segmentation of MPEG Compressed Frame Sequences”, Proc.
as:                                                                         ICIAR'04, pp. 310-317
                                                                      [5]   M. Durik et al, “Robust Motion Characterisation for Video Indexing
   dx   pan  zoom⋅ x − rot ⋅ y + hyp1⋅ x + hyp2 ⋅ y                   based on Optical Flow” Proc. CBMI’01, pp. 57-64.
    =       +                                             [6]   P. Bouthemy et al. “A unified approach to shot change detection and
    dy   tilt   zoom⋅ y + rot ⋅ x − hyp1⋅ y + hyp2 ⋅ y                camera motion characterization”, IEEE Trans. on CSVT, 9(7), pp.
                                                                            1030-1044
(4)
   Then two statistical hypotheses are tested on each parameter
of this model. The first one H 0 consists in supposing that the
parameter is significant, the second one H 1 assumes that the
component is not significant, i.e. equals zero.
  The likelihood function f for each hypothesis is defined with