=Paper=
{{Paper
|id=Vol-233/paper-5
|storemode=property
|title=Cepstral Polynomial Regression For Sequential Detection of Impulsive Waveform in Video Sound-Track
|pdfUrl=https://ceur-ws.org/Vol-233/p11.pdf
|volume=Vol-233
|dblpUrl=https://dblp.org/rec/conf/samt/HoryCK06
}}
==Cepstral Polynomial Regression For Sequential Detection of Impulsive Waveform in Video Sound-Track==
Cepstral Polynomial Regression For Sequential
Detection Of Impulsive Waveform In Video
Sound-Track
Cyril Hory, William J. Christmas, Anil Kokaram
Abstract— A new set of features is introduced for characteri- set of features based on cepstral analysis of the recorded
zation of impulsive events in video clips from the audio signal. events. Denote by c = [c1 , c2 , . . . , cN ]T the vector of cepstral
The discriminative power of these features to detect and isolate coefficients of the event e [2]:
racket hits in tennis video clip is discussed.
Index Terms— cepstral features, sequential classification c = |FT−1 {log(|FT{e}|)}| , (1)
where FT{.} is the discrete Fourier transform. Assume there
I. I NTRODUCTION exists a vector a(p) = [a0 , a1 , . . . , ap ]T such that
In digital video analysis it is now apparent that audio cues c = Q(p) a(p) + ν(p) , (2)
extracted from a video, along with the visual cues, can provide
relevant information for semantic understanding of the content. where Q(p) is a N ×(p+1) matrix with element gj−1 (qi−1 ) on
In [5] for example, audio and visual features are combined the ith row and jth column, where gj can be any conveniently
within an HMM framework for parsing tennis video. Audio- chosen polynomial of order j and qi the ith quefrency index,
visual cooperation is ensured through multi-modal conditional and ν(p) is an N × 1 vector of random perturbations. The
density estimation in [1]. Mel-cepstral coefficients are used mean-square error estimate â(p) of the vector of regression
in [4] to identify specific sounds in a baseball game video- coefficients a(p) is:
clip in order to detect commercials, speech or music using the −1 T
â(p) = R(p) Q(p) c , (3)
maximum entropy method.
In many application domains it is possible to identify some with R(p) = QT(p) Q(p) . The Cepstral Regression Coefficients
critically informative elementary short-terms event. However, (CRC) â(p) are descriptors of the cepstrum content of the
even though visual attributes can be as informative as audio detected events.
attributes for characterizing short-terms event, audio data are In unsupervised sequential classification and learning the
more convenient to handle in terms of computation load. amount of available data is often limited and small. Low
Among short term events impulsive waveforms can be pe- dimensional feature spaces must be considered in order to
culiarly informative. For instance accurate percussive sound cope with the curse of dimensionality. In such a situation the
detection can help with beat analysis. Racket hits are particu- CRC’s allow for the encoding of the whole information carried
larly informative events for the understanding a tennis game. by the cepstrum in a small number of coefficients. Moreover
From the detection and characterisation of racket hits, it is the inversion of the (p + 1) × (p + 1) matrix R(p) in (3) can be
possible to extract information such as the score, player fitness performed recursively by using a block matrix decomposition
and skills, or the strategy. and the Schur complement of R(p−1) . The dimensionality of
We have proposed in [3] a semi-supervised sequential scheme the feature space can thus be adaptively updated to match the
for detecting events from the audio stream of a video sequence size of the dataset. The recursive computation of the regression
using the Generalized CUSUM procedure. Following this de- coefficients makes these features particularly appealing for the
tection step, a system for event identification can be triggered implementation of a sequential classification system.
when an event is detected.
III. E XPERIMENTAL VALIDATION
II. C EPSTRAL FEATURES EXTRACTION A classification experiment was carried on excerpts of tennis
We focus here on the identification of impulsive waveforms video clip to evaluate the capability of the proposed features to
from the audio content after detection. We propose a new discriminate impulsive waveforms. The impulsive waveform
(target class of the classification experiment) are the racket
Cyril Hory is with Laboratoire Traitement et Communication de hits. A Biased Discriminant Analysis [6] was performed within
l’Information, CNRS-GET/Télécom-Paris, 37-39 rue Dareau, 75014 Paris.
William J. Christmas is with University of Surrey, Centre for Vision Speech a supervised learning framework. The experiment has been
and Signal Processing, Guilford GU2 7XH, UK carried on the second and third game of the Australian Open
Anil Kokaram is with University of Dublin, Trinity College, EEE Depart- final tennis game of 2003. The training set contains 203 events
ment, College Green, Dublin 2 Ireland.
This work was carried out during the tenure of a MUSCLE Internal that were detected by the CUSUM test including 20 actual
fellowship. racket hits. The test data contains 479 events that were detected
by the CUSUM test including 61 racket hits. mances on the test set are similar whatever the feature space
On Fig. 1 are displayed the ROC curves of the classifier based dimension. When applied to the training set, the 7-dimension
1
FCC vector behaves as the CRC’s and outperforms the SRC’s.
However the performances of the FCC’s dramatically deterio-
0.9
rate when applied to the test set. This shows that a classifier
0.8
based on the FCC’s is spoiled by an over-fitting phenomenon.
0.7
If modelling the waveform as the convolution of a source
Detection probability
0.6
waveform and a filter impulse response, FCC’s encode infor-
0.5
mation about the filter while high quefrency coefficients are
0.4
characteristic of the source [2]. In the experiment the impulse
0.3
response of the filter depends on the acoustic characteristics of
0.2 the hall and on the electronic recording device. It is a common
0.1 to all the extracted events. Thus the filter characteristics can
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
not provide relevant information for discriminating the various
False alarm probability
events.
(a) First Cepstral Coefficients (liftering) The CRC of order 6 exhibits a better ability to characterise and
1
discriminate the racket hits than the CRC of order 2 except
0.9 at high false alarm probability. This shows that the CRC of
0.8 order 6 tends to perform an over-fitting of the training set. As
0.7 a consequence, outliers racket hits are more often taken into
Detection probability
0.6 account in the model. In this case, the outliers are two lifted
0.5 shots. The high probability of detection obtained, even though
0.4
the characteristic of the class has evolved from the second to
0.3
the third game, shows that the CRC features are relevant to
0.2
encode the cepstrum content of a racket hit in a non-stationary
0.1
context.
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 IV. C ONCLUSION
False alarm probability
(b) Cepstral Regression Coefficients
The CRC’s perform better than the standard FCC’s in
1
a low-dimension feature space but the discrepancy between
0.9
performances of the two feature vectors decreases when the
dimension increases. One can conclude that the CRC’s seem
0.8
more appropriate for classification of impulsive waveform
0.7
when dealing with a small data set. In a higher dimension
Detection probability
0.6
feature space performances are equivalent but feature vector
0.5
computed from the polynomial regression (CRC’s and SRC’s)
0.4
tends to provide less over-fitting than the standard FCC’s.
0.3
The high discriminating power in a small dimensional feature
0.2
space and the recursive computation of the features allows for
0.1
their integration in a sequential and adaptive learning system.
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Work is currently being carried to show how the proposed
False alarm probability
features could improve the retrieval results obtained here in a
(c) Spectral Regression Coefficients static supervised learning context.
Fig. 1. ROC curve of the Biased Discriminant Analysis for order 2 (dash-
dotted line) and order 6 (plain line) of the (a) first cepstral coefficients,
R EFERENCES
(b) cepstral polynomial regression coefficients, and (c) spectral polynomial [1] R. Dahyot, A. Kokaram, A. Rea, and H. Denman, “Joint audio visual
regression coefficients . The grey lines correspond to the analysis of the retrieval for tennis broadcasts,” in Proceedings of IEEE ICASSP’03, 2003,
training data. The black lines correspond to the analysis of the test data. pp. 561–564.
[2] J. R. Deller, J. G. Proakis, and J. H. L. Hansen, Discrete-Time Processing
on the First Cepstral Coefficients (FCC), CRC’s, and Spectral of Speech Signals. MacMillan, 1993.
[3] C. Hory, A. Kokaram, and W. J. Christmas, “Threshold learning from
Regression Coefficients (SRC) of the extracted events. The samples drawn from the null hypothesis for the GLR CUSUM test,” in
3-dimension CRC vector outperforms the FCC’s whether the Proc. IEEE MLSP, 2005, pp. 111–116.
training set or the test set is classified. The 3 FCC’s fail to [4] W. Hua, M. Han, and Y. Gong, “Baseball scene classification using
multimedia features,” in Proceedings of IEEE ICME’02, 2002, pp. 821–
encode enough information about the event. 824.
The test set classification performances of the CRC’s and [5] E. Kijak, G. Gravier, P. Gros, L. Oisel, and F. Bimbot, “HMM based
SRC’s are equivalent although the CRC’s outperforms the structuring of tennis videos using visual and audio cues,” in Proc. IEEE
ICME’03, 2003, pp. 309–312.
SRC’s when applied on the training set. [6] X. S. Zhou and T. S. Huang, “Small Sample Learning during Multimedia
The 7-dimension FCC’s performs better than the 3-dimension Retrieval using BiasMap,” in Proceedings of IEEE CVPR’01, December
FCC’s when applied to the training set although the perfor- 2001, pp. 11–17.