MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling Han Su1, Fang-Fei Kuo2, Chu-Hsiang Chiu1, Yen-Ju Chou1, Man-Kwan Shan1 Department of Computer Science, National Chengchi University, Taipei, Taiwan1 Department of Electrical Engineering, University of Washington, Washington, America2 {101753004,101753026,101971001,mkshan}@nccu.edu.tw1, ffkuo@uw.edu2 ABSTRACT This paper presents our approaches of soundtrack selection for commercials based on audio/visual correlation analysis. Two approaches are adopted. One is based on multimodal latent semantic analysis (MLSA) and the other is based on cross-modal factor analysis (CFA). The evaluation based on the MediaEval Soundtrack Selection for Commercials Dataset shows the performance of our systems. Keywords Soundtrack selection, Multimodal correlation analysis, Multi-type latent semantic analysis, Cross-modal factor analysis Figure 1: System Architecture of Our Approaches [1]. 1. MOTIVATION Automatic soundtrack selection for videos has received 3. AUDIO WORD EXTRACTION more and more attention. The rationale of our approach for We use the officially provided audio features including automatic soundtrack selection is based on the latent Beat, Key, MFCC, BLF, and PS09 [4] and transform into correlation of the video and audio from training data audio words by discretization or vector quantization (VQ). (Development Dataset). Two methods of multimodal For one-dimensional descriptors such as the descriptors of correlation model learning are utilized in our approach. In Beat, the equal frequency binning is employed for this paper, we present our soundtrack recommendation discretization. The number of bins is set to 19, which is the using the two methods respectively and evaluate the system square root of the number of devset [7]. For the on the MediaEval corpus. multidimensional descriptor, clustering-based vector quantization is performed to group descriptors in the 2. SYSTEM ARCHITECTURE feature space into clusters. For the descriptors of BLF, we Figure 1 shows the architecture of the proposed soundtrack use Manhattan distance to measure the distance and utilize selection based on our previous work [1]. In the training the average link and complete link respectively. For the phase, we first transform the descriptors of audio/visual descriptors of PS09 and the FP descriptor of MFCC, we use features provided in the development dataset (devset) to the the Euclidean distance along with the K-means. For each of audio /visual words and generate the audio/visual feature the three descriptors of MFCC, Gaussian Mixture Model is matrices. Then two algorithms are employed to find the utilized to model the frame-based representation of an content correlation model from the visual/audio feature audio. Then K-L divergence along with Earth Mover matrices. For the recommendation dataset (recset), the distance is used to measure the distance, followed by audio features of each soundtrack are transformed into average link and complete link clustering algorithms. audio words in the same way as the development dataset do. After vector quantization/discretization, each cluster/bin In the test phase, given a test video, the descriptors of may be regarded as an audio word that represents the visual features are transformed into visual words in the descriptor belonging to that cluster/bin. An audio descriptor same way as those of the devset The transformed visual is encoded into an audio word vector by the index of the words of the test video along with the audio words of recset cluster/bin to which it belongs. An audio word vector are fed into the learned content correlation model and the contains the presence or absence information of each audio ranking results for soundtrack selection are generated. word in the soundtrack while the audio feature vector for a soundtrack is formed by the concatenation of the audio word vectors respective to all types of descriptors. Copyright is held by the author/owner(s). MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain 4. VISUAL WORD EXTRACTION with zero audio feature vector. To project onto the latent The officially provided visual features are based on MPEG- space, yq is multiplied by Ck. The likelihood of occurrence 7. In MPEG, the determination of frame types (I, P, B- l(a,f) between an audio descriptor a and the test video f is frames) depends on the compression algorithm of the the cosine similarity between yqCk and the row vector of Ck MPEG encoder. While I-frames may not be key-frames, in corresponding to the audio descriptor a. Then the similarity our work, the visual features are extracted in the shot-level score between a sound track m and the test video f where the shot boundary detection is based on calculating 𝑟 𝑚, 𝑓 = ∀  !  ∈! 𝑙(𝑎, 𝑓). edge change fraction in temporal domain [8]. Then we The top five soundtracks in recset are recommended for extract 13 types of visual descriptors including the color each test video. energy, saturation proportion, angular second moment, contrast, correlation, dissimilarity, entropy, homogeneity, 6. PERFORMANCE EVALUATION We take five-fold cross-validation on the devset to evaluate GLCM mean, GLCM variance, light median, shadow the performance of our approach and select the best three proportion and visual excitement [1]. Since each of the 13 models to obtain the ranking result. The original soundtrack visual descriptors is scalar, equal frequency binning is of the commercial is regarded as the ground truth and is performed for generation of visual words. Visual word ranked along with music objects in recset. The accuracy in vectors and visual feature vectors are encoded in the same our work is defined as 1-(rank(g)-1)/(|C|+1) where rank(g) way as audio word vectors and audio feature vectors. is the rank of the ground truth, 𝐶 is the number of music 5. CONTENT CORRELATION MODELING in recset. Results with top-2 accuracy for CFA and top-1 & RECOMMENDATION accuracy for MLSA are submitted. Table 1 shows the We investigate two approaches for learning correlation adopted learning algorithms, parameters, accuracy, and the between audio and visual contents from devset. officially rated score of our submitted three results. 5.1 CFA (Cross-Modal Factor Analysis) Table 1. Performance and Parameters of Submitted Results. CFA tries to find the correlation by transforming the audio Algorithm CFA CFA MLSA and visual contents into a common space [2]. Given an No. Clusters(GMM, MFCC) 10 10 10 audio feature matrix X and a video feature matrix Y No. Clusters(KL, MFCC) 10 10 10 No. Clusters(FP) 20 20 10 where each row corresponds to the feature vector of a No. Clusters (BLF) 30 10 20 commercial, CFA finds the orthonormal transformation Eigen-number 200 150 400 matrices A and B that minimize XA-YB2 where M is the Accuracy 0.670 0.673 0.547 Frobenius norm of matrix M. Matrices A and B can be First rank average 2.292 2.289 2.272 obtained by Singular Value Decomposition (SVD) on XTY Top-five average 2.264 2.259 2.211 such that A=Uxy, B=Vxy, where XTY = UxySxyVxy. Matrices A and B encode the correlation information. In our work, REFERENCES given a test video f with visual feature vector yf and a [1] F. F. Kuo, M. K. Shan, and S. Y. Lee, Background Music soundtrack m with audio feature vector xm, the distance d(m, Recommendation for Video Based on Multimodal Latent f) between m and f is the Euclidean distance between xmA Semantic Analysis, IEEE Intl. Conf. on Multimedia and and yfB. The nearest five soundtracks in recset are Expo, 2013. recommended for each test video. [2] D. Li, N. Dimitrova, M. Li, and I. K. Sethi, Multimedia Content Processing through Cross-Modal Association, ACM 5.2 MLSA (Multi-type Latent Semantic Intl. Conf. on Multimedia, 2003. Analysis) [3] C. C. S. Liem, M. Larson, and A. Hanjalic, When Music The other approach we adopted is MLSA that exploits Makes a Scene – Characterizing Music in Multimedia pairwise co-occurrence correlations among multiple types Contexts Via User Scene Descriptions, Intl. Journal of of entities (descriptors). MLSA represents the entities and Multimedia Information Retrieval, Vol. 2, Issue, 1, 2013. correlations by a unified co-occurrence matrix [4] T. Pohle, D. Schmitzer, M. Schedl, P. Knees, and G.      0            𝑀!"          ⋯      𝑀!! Widmer, On Rhythm and General Music Similarity, Intl. Symp. for Music Information Retrieval, 2009.      𝑀!"      0                 ⋯      𝑀!! [5] J. Urbano, and M. Schedl, Minimal Test Collections for 𝐶 =    ⋮              ⋮                  ⋱                ⋮ Low-Cost Evaluation of Audio Music Similarity and 𝑀!!    𝑀!!      ⋯          0 Retrieval Systems, Intl. Journal of Multimedia Information C is composed of N ×N correlation matrices, where 𝑁 is Retrieval, 2013. the total number of descriptor types. 𝑀!" is the co- [6] X. Wang, J. T. Sun, Z. Chen, and C. X. Zhai, Latent occurrence matrix of descriptor type i and j. C can be Semantic Analysis for Multiple-Type Interrelated Data Objects, ACM Intl. Conf. on Information Retrieval, 2013. decomposed by eigen decomposition. The top k [7] Y. Yang and G. Webb, Proportional k-Interval Discretization eigenvalues 𝜆! ≥ 𝜆! ≥ ⋯ ≥ 𝜆! and the corresponding for Naïve-Bayes Classifiers, European Conf. on Machine eigenvectors [e1, e2, ..., ek] can span a k-dimensional latent Learning, 2001. space, which can be represented as an matrix Ck = [λ1·e1, [8] R. Zabih, J. Miller, and K. Mai, A Feature-based Algorithm λ2·e2, …, λg·ek]. Given a test video f with feature vector yf, for Detecting and Classifying Scene Breaks, ACM Intl. Conf. we first generate the query vector yq by concatenating yf on Multimedia,1995.