=Paper=
{{Paper
|id=None
|storemode=property
|title=MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling
|pdfUrl=https://ceur-ws.org/Vol-1043/mediaeval2013_submission_98.pdf
|volume=Vol-1043
|dblpUrl=https://dblp.org/rec/conf/mediaeval/SuKCCS13
}}
==MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling==
MediaEval 2013: Soundtrack Selection for Commercials
Based on Content Correlation Modeling
Han Su1, Fang-Fei Kuo2, Chu-Hsiang Chiu1, Yen-Ju Chou1, Man-Kwan Shan1
Department of Computer Science, National Chengchi University, Taipei, Taiwan1
Department of Electrical Engineering, University of Washington, Washington, America2
{101753004,101753026,101971001,mkshan}@nccu.edu.tw1, ffkuo@uw.edu2
ABSTRACT
This paper presents our approaches of soundtrack selection
for commercials based on audio/visual correlation analysis.
Two approaches are adopted. One is based on multimodal
latent semantic analysis (MLSA) and the other is based on
cross-modal factor analysis (CFA). The evaluation based
on the MediaEval Soundtrack Selection for Commercials
Dataset shows the performance of our systems.
Keywords
Soundtrack selection, Multimodal correlation analysis, Multi-type
latent semantic analysis, Cross-modal factor analysis
Figure 1: System Architecture of Our Approaches [1].
1. MOTIVATION
Automatic soundtrack selection for videos has received 3. AUDIO WORD EXTRACTION
more and more attention. The rationale of our approach for We use the officially provided audio features including
automatic soundtrack selection is based on the latent Beat, Key, MFCC, BLF, and PS09 [4] and transform into
correlation of the video and audio from training data audio words by discretization or vector quantization (VQ).
(Development Dataset). Two methods of multimodal For one-dimensional descriptors such as the descriptors of
correlation model learning are utilized in our approach. In Beat, the equal frequency binning is employed for
this paper, we present our soundtrack recommendation discretization. The number of bins is set to 19, which is the
using the two methods respectively and evaluate the system square root of the number of devset [7]. For the
on the MediaEval corpus. multidimensional descriptor, clustering-based vector
quantization is performed to group descriptors in the
2. SYSTEM ARCHITECTURE feature space into clusters. For the descriptors of BLF, we
Figure 1 shows the architecture of the proposed soundtrack use Manhattan distance to measure the distance and utilize
selection based on our previous work [1]. In the training the average link and complete link respectively. For the
phase, we first transform the descriptors of audio/visual descriptors of PS09 and the FP descriptor of MFCC, we use
features provided in the development dataset (devset) to the the Euclidean distance along with the K-means. For each of
audio /visual words and generate the audio/visual feature the three descriptors of MFCC, Gaussian Mixture Model is
matrices. Then two algorithms are employed to find the utilized to model the frame-based representation of an
content correlation model from the visual/audio feature audio. Then K-L divergence along with Earth Mover
matrices. For the recommendation dataset (recset), the distance is used to measure the distance, followed by
audio features of each soundtrack are transformed into average link and complete link clustering algorithms.
audio words in the same way as the development dataset do. After vector quantization/discretization, each cluster/bin
In the test phase, given a test video, the descriptors of may be regarded as an audio word that represents the
visual features are transformed into visual words in the descriptor belonging to that cluster/bin. An audio descriptor
same way as those of the devset The transformed visual is encoded into an audio word vector by the index of the
words of the test video along with the audio words of recset cluster/bin to which it belongs. An audio word vector
are fed into the learned content correlation model and the contains the presence or absence information of each audio
ranking results for soundtrack selection are generated. word in the soundtrack while the audio feature vector for
a soundtrack is formed by the concatenation of the audio
word vectors respective to all types of descriptors.
Copyright is held by the author/owner(s).
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
4. VISUAL WORD EXTRACTION with zero audio feature vector. To project onto the latent
The officially provided visual features are based on MPEG- space, yq is multiplied by Ck. The likelihood of occurrence
7. In MPEG, the determination of frame types (I, P, B- l(a,f) between an audio descriptor a and the test video f is
frames) depends on the compression algorithm of the the cosine similarity between yqCk and the row vector of Ck
MPEG encoder. While I-frames may not be key-frames, in corresponding to the audio descriptor a. Then the similarity
our work, the visual features are extracted in the shot-level score between a sound track m and the test video f
where the shot boundary detection is based on calculating 𝑟 𝑚, 𝑓 = ∀ ! ∈! 𝑙(𝑎, 𝑓).
edge change fraction in temporal domain [8]. Then we The top five soundtracks in recset are recommended for
extract 13 types of visual descriptors including the color each test video.
energy, saturation proportion, angular second moment,
contrast, correlation, dissimilarity, entropy, homogeneity,
6. PERFORMANCE EVALUATION
We take five-fold cross-validation on the devset to evaluate
GLCM mean, GLCM variance, light median, shadow
the performance of our approach and select the best three
proportion and visual excitement [1]. Since each of the 13
models to obtain the ranking result. The original soundtrack
visual descriptors is scalar, equal frequency binning is
of the commercial is regarded as the ground truth and is
performed for generation of visual words. Visual word
ranked along with music objects in recset. The accuracy in
vectors and visual feature vectors are encoded in the same
our work is defined as 1-(rank(g)-1)/(|C|+1) where rank(g)
way as audio word vectors and audio feature vectors.
is the rank of the ground truth, 𝐶 is the number of music
5. CONTENT CORRELATION MODELING in recset. Results with top-2 accuracy for CFA and top-1
& RECOMMENDATION accuracy for MLSA are submitted. Table 1 shows the
We investigate two approaches for learning correlation adopted learning algorithms, parameters, accuracy, and the
between audio and visual contents from devset. officially rated score of our submitted three results.
5.1 CFA (Cross-Modal Factor Analysis) Table 1. Performance and Parameters of Submitted Results.
CFA tries to find the correlation by transforming the audio Algorithm CFA CFA MLSA
and visual contents into a common space [2]. Given an No. Clusters(GMM, MFCC) 10 10 10
audio feature matrix X and a video feature matrix Y No. Clusters(KL, MFCC) 10 10 10
No. Clusters(FP) 20 20 10
where each row corresponds to the feature vector of a
No. Clusters (BLF) 30 10 20
commercial, CFA finds the orthonormal transformation
Eigen-number 200 150 400
matrices A and B that minimize XA-YB2 where M is the
Accuracy 0.670 0.673 0.547
Frobenius norm of matrix M. Matrices A and B can be
First rank average 2.292 2.289 2.272
obtained by Singular Value Decomposition (SVD) on XTY
Top-five average 2.264 2.259 2.211
such that A=Uxy, B=Vxy, where XTY = UxySxyVxy. Matrices A
and B encode the correlation information. In our work, REFERENCES
given a test video f with visual feature vector yf and a [1] F. F. Kuo, M. K. Shan, and S. Y. Lee, Background Music
soundtrack m with audio feature vector xm, the distance d(m, Recommendation for Video Based on Multimodal Latent
f) between m and f is the Euclidean distance between xmA Semantic Analysis, IEEE Intl. Conf. on Multimedia and
and yfB. The nearest five soundtracks in recset are Expo, 2013.
recommended for each test video. [2] D. Li, N. Dimitrova, M. Li, and I. K. Sethi, Multimedia
Content Processing through Cross-Modal Association, ACM
5.2 MLSA (Multi-type Latent Semantic Intl. Conf. on Multimedia, 2003.
Analysis) [3] C. C. S. Liem, M. Larson, and A. Hanjalic, When Music
The other approach we adopted is MLSA that exploits Makes a Scene – Characterizing Music in Multimedia
pairwise co-occurrence correlations among multiple types Contexts Via User Scene Descriptions, Intl. Journal of
of entities (descriptors). MLSA represents the entities and Multimedia Information Retrieval, Vol. 2, Issue, 1, 2013.
correlations by a unified co-occurrence matrix [4] T. Pohle, D. Schmitzer, M. Schedl, P. Knees, and G.
0 𝑀!" ⋯ 𝑀!! Widmer, On Rhythm and General Music Similarity, Intl.
Symp. for Music Information Retrieval, 2009.
𝑀!" 0 ⋯ 𝑀!! [5] J. Urbano, and M. Schedl, Minimal Test Collections for
𝐶 =
⋮ ⋮ ⋱ ⋮ Low-Cost Evaluation of Audio Music Similarity and
𝑀!! 𝑀!! ⋯ 0 Retrieval Systems, Intl. Journal of Multimedia Information
C is composed of N ×N correlation matrices, where 𝑁 is Retrieval, 2013.
the total number of descriptor types. 𝑀!" is the co- [6] X. Wang, J. T. Sun, Z. Chen, and C. X. Zhai, Latent
occurrence matrix of descriptor type i and j. C can be Semantic Analysis for Multiple-Type Interrelated Data
Objects, ACM Intl. Conf. on Information Retrieval, 2013.
decomposed by eigen decomposition. The top k
[7] Y. Yang and G. Webb, Proportional k-Interval Discretization
eigenvalues 𝜆! ≥ 𝜆! ≥ ⋯ ≥ 𝜆! and the corresponding for Naïve-Bayes Classifiers, European Conf. on Machine
eigenvectors [e1, e2, ..., ek] can span a k-dimensional latent Learning, 2001.
space, which can be represented as an matrix Ck = [λ1·e1, [8] R. Zabih, J. Miller, and K. Mai, A Feature-based Algorithm
λ2·e2, …, λg·ek]. Given a test video f with feature vector yf, for Detecting and Classifying Scene Breaks, ACM Intl. Conf.
we first generate the query vector yq by concatenating yf on Multimedia,1995.