Indexing Camera Motion Integrating Knowledge of the Quality of the Encoded Video P. Krämer, J. Benois-Pineau, member IEEE, M. Gràcia Pla claim that a compressed stream is a rich source of input data Abstract—Fast indexing of video contents in the compressed for indexing and this is only the matter of interpretation for the domain has become an important task as growing quantities of intelligent use of it. In this paper we show how we can truly multimedia (MM) digital content are available in this form. In use not only MPEG (1 or 2) motion vectors, but also the this paper we present a method for fast indexing of camera motion of MPEG1 and 2 compressed video. We use P-frame information on the quality of their estimation in order to motion vectors and extract some knowledge on the quality of the estimate the camera model (Section 2) and to qualify motion in compensated motion from the compressed stream. It is then used the humanly interpretable way (Section 3). This is for instance for decision making on the motion refinement. Then camera a task of camera motion characterization in TREC Video motion is indexed in terms of physical motions. Results obtained 2005, where we did participate. We show how this knowledge on the TREC Video test data set are interesting. helps us to improve the indexing results and give the perspectives of this work (Section 4). Index Terms— video indexing, camera motion, compressed streams. II. GLOBAL MOTION ESTIMATION AND CORRECTION FROM I. INTRODUCTION MPEG COMPRESSED VIDEO Indexing and annotating large quantities of films and video In this section we address the problem of estimating the global material has become an increasing problem for the media (camera model) in a video sequence. Here we use motion industry. Today, indexing for large application areas such as compensation vectors from P-frames. In order to remain the broadcast, archives, and home MM devices definitely follows same temporal resolution and get a smooth motion trajectory, we interpolate it for I-frames. Finally, as MPEG motion MPEG7 – the compliant way. This is a standard [1] for vectors are not computed for analysis purposes, but for optimal describing the multimedia content. For visual media, it defines encoding, they can be very much erroneous (e.g. in case of descriptors to characterize the content on a visual basis. In strong motion), we propose how to detect such encoder video, which intrinsic property is motion, it proposes motion failures and how to correct the motion. descriptors. Nevertheless, MPEG7 does not give hints on how to produce a standard compliant description of e.g. camera II.1 Global motion estimation from P-frames motion, and how to translate this description into features Here we rely on our previous work [5] and use a 6 easily interpreted by humans such as tilt, zoom, or pan… A lot parameter affine camera model. We suppose [5] that an MPEG of multimedia content is already available in compressed form. macro-block displacement vector is expressed as: Furthermore, a digitization of the existing video content and  dx   a1   a 2 a3  x − x g  digital production of new content are today unthinkable   =   +    (1)  y − y  without compression. Thus a lot of work [2 – 4] has been    2  5 dy a a a 6  g  devoted to the estimation of the camera model from motion vectors contained in the compressed stream. This work is where a1 ,..., a 6 are the global motion parameters of camera another step forward in the general framework which we call T and ( x g , y g ) denotes the image center. The estimation by a “Rough Indexing Paradigm” and has been developed since [5]. robust estimator that we proposed in [5], allows classifying A whole lot of indexing tasks such as shot boundary detection, macro-blocks (MBs) as conformant to the model, what we call scene grouping, video summarization, video object extraction, the “dominant estimation support”, or outliers. The latter or motion characterization can be fulfilled on degraded and contain intra-coded MBs, MBs in moving objects and in low-resolution/low-level data produced by encoding video occluding areas. This approach supposes that in a current P- streams with current encoders (MPEG1, 2, H.264 …). We frame, there are motion vectors, which express the apparent camera motion. Unfortunately this is not always the case. In P. Krämer and J. Benois-Pineau are with LABRI UMR CNRS/University order to re-cover the real camera motion in such frames it is of Bordeaux 1/Enseirb/INRIA laboratory, 351, crs de la Libération, 33405 Talence Cedex, France; petra.kraemer, jenny.benois@labri.fr; phone 33 5 40 necessary to detect encoder failures and to correct the motion. 00 84 24, fax 33 5 40 00 66 69. M. Gràcia Pla has been on master position in LABRI on leave from UPC, Barcelona, Spain. II.2 Detection of frames with low–quality motion and motion respect to the residuals between the estimated model and the correction MPEG motion vectors. These residuals are supposed to follow If the MPEG encoder motion estimator failed, the motion the bi-variate Gaussian law. The decision on the significance is compensation error encoded in the MPEG stream is strong. made by a comparison of the log-likelihood ratio with a Such failures are very much dependent on the parameter threshold. We used this scheme in our previous work, but in settings of the encoder and are specifically observed in the case of the knowledge on a bad estimation that is available case of strong motion (e.g. soccer content). from (2), we do not compute residuals between the erroneous MPEG motion vectors and those obtained by the re-estimated We compute the mean low frequency energy E t on the model. The interpolated parameters are used as reference (light dominant estimation support Dt i.e. excluding the motion correction) in this case. outliers: 1 Et = Dt ∑ DC ( p, t ) err P 2 (2) IV. RESULTS AND CONCLUSION p∈Dt err To assess the improvement due to the proposed integration Here DCP ( p, t ) are the DC coefficients extracted from the of the knowledge on erroneous motion and re-estimation of encoded error in P-frames. motion (3), we conducted experiments on the evaluation set of To take the decision if the motion model has to be the TREC Video camera motion task http://www- corrected, we use the temporal mean γ t of (2). If the nlpir.nist.gov/projects/trecvid/ in which we participated in 2005. A subset of 4 videos containing visually observable instantaneous value of (2) exceeds αγ t , with α ≥ 1 then the motion was chosen. Using α = 4.0 in the decision rule, about motion will be corrected. 4% of the P-frame motion is corrected. Due to this correction To fulfill this correction we first interpolate the motion we obtain a mean precision of 76% and a mean recall of model from neighboring P-frames by a linear regression. This 86.1%. Without the correction 74.5% and 78.7% are obtained interpolation is used as the initialization of the model estimate respectively. We have to stress that the increase of recall of 8 in the gradient descent scheme. % is already very much significant for this task. Here we minimize the functional of the mean square error of Hence in this paper we proposed a new method for motion the motion compensation at DC resolution on the dominant correction when estimating and indexing camera motion from estimation support: compressed (MPEG1 and MPEG2) video streams. 1 r 2 We tested it for indexing purposes on the MPEG1 MSEt = Dt ∑ (I ( p ) − I ( p + d ) ) t t −1 (3) compressed TREC Video test set. For video summarizing by p∈D t mosaicing from compressed streams and for other indexing The optimization is done in the parameter space by gradient applications (shot boundary detection, object extraction) we descent: work on MPEG2 compressed streams as well. There is no ε principal difference and the method reveals promising for the Θ it +1 = Θ it − Gi whole Rough Indexing Paradigm, we continue developing on 2 Dt compressed streams. with G as the gradient of (3) and ε as the adaptive gain matrix. REFERENCES [1] MPEG-7 Requirements Document V.7: Coding of Moving Pictures and Audio [2] E. Saez et al., “Global motion estimation algorithm for video III. CAMERA MOTION INDEXING segmentation”, Proc. SPIE, VCIP'03, pp. 1540-1550 The objective here is to translate the motion model (1) into [3] R. Ewerth et al. “Estimation of arbitrary camera motion in {MPEG} videos”, Proc. ICPR'04, pp. 512-515 physical motion, interpretable by humans, such as pan, tilt, or [4] C. Doulaverakis et al. , “Adaptive Methods for Motion Characterization zoom. To do this we follow [6] and reformulate the model (1) and Segmentation of MPEG Compressed Frame Sequences”, Proc. as: ICIAR'04, pp. 310-317 [5] M. Durik et al, “Robust Motion Characterisation for Video Indexing  dx   pan  zoom⋅ x − rot ⋅ y + hyp1⋅ x + hyp2 ⋅ y  based on Optical Flow” Proc. CBMI’01, pp. 57-64.   =   +   [6] P. Bouthemy et al. “A unified approach to shot change detection and  dy   tilt   zoom⋅ y + rot ⋅ x − hyp1⋅ y + hyp2 ⋅ y  camera motion characterization”, IEEE Trans. on CSVT, 9(7), pp. 1030-1044 (4) Then two statistical hypotheses are tested on each parameter of this model. The first one H 0 consists in supposing that the parameter is significant, the second one H 1 assumes that the component is not significant, i.e. equals zero. The likelihood function f for each hypothesis is defined with