MediaEval 2013: Soundtrack Selection for Commercials
                Based on Content Correlation Modeling
                 Han Su1, Fang-Fei Kuo2, Chu-Hsiang Chiu1, Yen-Ju Chou1, Man-Kwan Shan1
              Department of Computer Science, National Chengchi University, Taipei, Taiwan1
            Department of Electrical Engineering, University of Washington, Washington, America2
                {101753004,101753026,101971001,mkshan}@nccu.edu.tw1, ffkuo@uw.edu2

ABSTRACT
This paper presents our approaches of soundtrack selection
for commercials based on audio/visual correlation analysis.
Two approaches are adopted. One is based on multimodal
latent semantic analysis (MLSA) and the other is based on
cross-modal factor analysis (CFA). The evaluation based
on the MediaEval Soundtrack Selection for Commercials
Dataset shows the performance of our systems.
Keywords
Soundtrack selection, Multimodal correlation analysis, Multi-type
latent semantic analysis, Cross-modal factor analysis
                                                                        Figure 1: System Architecture of Our Approaches [1].
1. MOTIVATION
Automatic soundtrack selection for videos has received              3. AUDIO WORD EXTRACTION
more and more attention. The rationale of our approach for          We use the officially provided audio features including
automatic soundtrack selection is based on the latent               Beat, Key, MFCC, BLF, and PS09 [4] and transform into
correlation of the video and audio from training data               audio words by discretization or vector quantization (VQ).
(Development Dataset). Two methods of multimodal                    For one-dimensional descriptors such as the descriptors of
correlation model learning are utilized in our approach. In         Beat, the equal frequency binning is employed for
this paper, we present our soundtrack recommendation                discretization. The number of bins is set to 19, which is the
using the two methods respectively and evaluate the system          square root of the number of devset [7]. For the
on the MediaEval corpus.                                            multidimensional descriptor, clustering-based vector
                                                                    quantization is performed to group descriptors in the
2. SYSTEM ARCHITECTURE                                              feature space into clusters. For the descriptors of BLF, we
Figure 1 shows the architecture of the proposed soundtrack          use Manhattan distance to measure the distance and utilize
selection based on our previous work [1]. In the training           the average link and complete link respectively. For the
phase, we first transform the descriptors of audio/visual           descriptors of PS09 and the FP descriptor of MFCC, we use
features provided in the development dataset (devset) to the        the Euclidean distance along with the K-means. For each of
audio /visual words and generate the audio/visual feature           the three descriptors of MFCC, Gaussian Mixture Model is
matrices. Then two algorithms are employed to find the              utilized to model the frame-based representation of an
content correlation model from the visual/audio feature             audio. Then K-L divergence along with Earth Mover
matrices. For the recommendation dataset (recset), the              distance is used to measure the distance, followed by
audio features of each soundtrack are transformed into              average link and complete link clustering algorithms.
audio words in the same way as the development dataset do.          After vector quantization/discretization, each cluster/bin
In the test phase, given a test video, the descriptors of           may be regarded as an audio word that represents the
visual features are transformed into visual words in the            descriptor belonging to that cluster/bin. An audio descriptor
same way as those of the devset The transformed visual              is encoded into an audio word vector by the index of the
words of the test video along with the audio words of recset        cluster/bin to which it belongs. An audio word vector
are fed into the learned content correlation model and the          contains the presence or absence information of each audio
ranking results for soundtrack selection are generated.             word in the soundtrack while the audio feature vector for
                                                                    a soundtrack is formed by the concatenation of the audio
                                                                    word vectors respective to all types of descriptors.
 Copyright is held by the author/owner(s).
 MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain
4. VISUAL WORD EXTRACTION                                                             with zero audio feature vector. To project onto the latent
The officially provided visual features are based on MPEG-                            space, yq is multiplied by Ck. The likelihood of occurrence
7. In MPEG, the determination of frame types (I, P, B-                                l(a,f) between an audio descriptor a and the test video f is
frames) depends on the compression algorithm of the                                   the cosine similarity between yqCk and the row vector of Ck
MPEG encoder. While I-frames may not be key-frames, in                                corresponding to the audio descriptor a. Then the similarity
our work, the visual features are extracted in the shot-level                         score between a sound track m and the test video f
where the shot boundary detection is based on calculating                                              𝑟 𝑚, 𝑓 = ∀  !  ∈! 𝑙(𝑎, 𝑓).
edge change fraction in temporal domain [8]. Then we                                  The top five soundtracks in recset are recommended for
extract 13 types of visual descriptors including the color                            each test video.
energy, saturation proportion, angular second moment,
contrast, correlation, dissimilarity, entropy, homogeneity,
                                                                                      6. PERFORMANCE EVALUATION
                                                                                      We take five-fold cross-validation on the devset to evaluate
GLCM mean, GLCM variance, light median, shadow
                                                                                      the performance of our approach and select the best three
proportion and visual excitement [1]. Since each of the 13
                                                                                      models to obtain the ranking result. The original soundtrack
visual descriptors is scalar, equal frequency binning is
                                                                                      of the commercial is regarded as the ground truth and is
performed for generation of visual words. Visual word
                                                                                      ranked along with music objects in recset. The accuracy in
vectors and visual feature vectors are encoded in the same
                                                                                      our work is defined as 1-(rank(g)-1)/(|C|+1) where rank(g)
way as audio word vectors and audio feature vectors.
                                                                                      is the rank of the ground truth, 𝐶 is the number of music
5. CONTENT CORRELATION MODELING                                                       in recset. Results with top-2 accuracy for CFA and top-1
& RECOMMENDATION                                                                      accuracy for MLSA are submitted. Table 1 shows the
We investigate two approaches for learning correlation                                adopted learning algorithms, parameters, accuracy, and the
between audio and visual contents from devset.                                        officially rated score of our submitted three results.
5.1 CFA (Cross-Modal Factor Analysis)                                                 Table 1. Performance and Parameters of Submitted Results.
CFA tries to find the correlation by transforming the audio                            Algorithm                      CFA        CFA        MLSA
and visual contents into a common space [2]. Given an                                  No. Clusters(GMM, MFCC)         10         10         10
audio feature matrix X and a video feature matrix Y                                    No. Clusters(KL, MFCC)          10         10         10
                                                                                       No. Clusters(FP)                20         20         10
where each row corresponds to the feature vector of a
                                                                                       No. Clusters (BLF)              30         10         20
commercial, CFA finds the orthonormal transformation
                                                                                       Eigen-number                    200        150        400
matrices A and B that minimize XA-YB2 where M is the
                                                                                       Accuracy                       0.670      0.673      0.547
Frobenius norm of matrix M. Matrices A and B can be
                                                                                       First rank average             2.292      2.289      2.272
obtained by Singular Value Decomposition (SVD) on XTY
                                                                                       Top-five average               2.264      2.259      2.211
such that A=Uxy, B=Vxy, where XTY = UxySxyVxy. Matrices A
and B encode the correlation information. In our work,                                REFERENCES
given a test video f with visual feature vector yf and a                              [1] F. F. Kuo, M. K. Shan, and S. Y. Lee, Background Music
soundtrack m with audio feature vector xm, the distance d(m,                              Recommendation for Video Based on Multimodal Latent
f) between m and f is the Euclidean distance between xmA                                  Semantic Analysis, IEEE Intl. Conf. on Multimedia and
and yfB. The nearest five soundtracks in recset are                                       Expo, 2013.
recommended for each test video.                                                      [2] D. Li, N. Dimitrova, M. Li, and I. K. Sethi, Multimedia
                                                                                          Content Processing through Cross-Modal Association, ACM
5.2 MLSA (Multi-type Latent Semantic                                                      Intl. Conf. on Multimedia, 2003.
Analysis)                                                                             [3] C. C. S. Liem, M. Larson, and A. Hanjalic, When Music
The other approach we adopted is MLSA that exploits                                       Makes a Scene – Characterizing Music in Multimedia
pairwise co-occurrence correlations among multiple types                                  Contexts Via User Scene Descriptions, Intl. Journal of
of entities (descriptors). MLSA represents the entities and                               Multimedia Information Retrieval, Vol. 2, Issue, 1, 2013.
correlations by a unified co-occurrence matrix                                        [4] T. Pohle, D. Schmitzer, M. Schedl, P. Knees, and G.
                             0            𝑀!"           ⋯      𝑀!!                        Widmer, On Rhythm and General Music Similarity, Intl.
                                                                                          Symp. for Music Information Retrieval, 2009.
                             𝑀!"       0                 ⋯      𝑀!!                   [5] J. Urbano, and M. Schedl, Minimal Test Collections for
                𝐶 =   
                               ⋮              ⋮                  ⋱                ⋮       Low-Cost Evaluation of Audio Music Similarity and
                             𝑀!!     𝑀!!       ⋯           0                              Retrieval Systems, Intl. Journal of Multimedia Information
C is composed of N ×N correlation matrices, where 𝑁 is                                    Retrieval, 2013.
the total number of descriptor types. 𝑀!" is the co-                                  [6] X. Wang, J. T. Sun, Z. Chen, and C. X. Zhai, Latent
occurrence matrix of descriptor type i and j. C can be                                    Semantic Analysis for Multiple-Type Interrelated Data
                                                                                          Objects, ACM Intl. Conf. on Information Retrieval, 2013.
decomposed by eigen decomposition. The top k
                                                                                      [7] Y. Yang and G. Webb, Proportional k-Interval Discretization
eigenvalues 𝜆! ≥ 𝜆! ≥ ⋯ ≥ 𝜆! and the corresponding                                        for Naïve-Bayes Classifiers, European Conf. on Machine
eigenvectors [e1, e2, ..., ek] can span a k-dimensional latent                            Learning, 2001.
space, which can be represented as an matrix Ck = [λ1·e1,                             [8] R. Zabih, J. Miller, and K. Mai, A Feature-based Algorithm
λ2·e2, …, λg·ek]. Given a test video f with feature vector yf,                            for Detecting and Classifying Scene Breaks, ACM Intl. Conf.
we first generate the query vector yq by concatenating yf                                 on Multimedia,1995.