MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling

MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling HanSu Department of Electrical Engineering University of Washington

Washington America

Fang-FeiKuo ffkuo@uw.edu

{101753004, 101753026, 101971001

Chu-HsiangChiu Department of Electrical Engineering University of Washington

Washington America

Yen-JuChou Department of Electrical Engineering University of Washington

Washington America

Man-KwanShan mkshan@nccu.edu Department of Electrical Engineering University of Washington

Washington America

Department of Computer Science National Chengchi University

Taipei Taiwan

MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling 4AA054D6EE37F471BE60E1296800D2E4 GROBID - A machine learning software for extracting information from scholarly documents Soundtrack selection Multimodal correlation analysis Multi-type latent semantic analysis Cross-modal factor analysis

This paper presents our approaches of soundtrack selection for commercials based on audio/visual correlation analysis. Two approaches are adopted. One is based on multimodal latent semantic analysis (MLSA) and the other is based on cross-modal factor analysis (CFA). The evaluation based on the MediaEval Soundtrack Selection for Commercials Dataset shows the performance of our systems.

MOTIVATION

Automatic soundtrack selection for videos has received more and more attention. The rationale of our approach for automatic soundtrack selection is based on the latent correlation of the video and audio from training data (Development Dataset). Two methods of multimodal correlation model learning are utilized in our approach. In this paper, we present our soundtrack recommendation using the two methods respectively and evaluate the system on the MediaEval corpus.

SYSTEM ARCHITECTURE

Figure 1 shows the architecture of the proposed soundtrack selection based on our previous work [1]. In the training phase, we first transform the descriptors of audio/visual features provided in the development dataset (devset) to the audio /visual words and generate the audio/visual feature matrices. Then two algorithms are employed to find the content correlation model from the visual/audio feature matrices. For the recommendation dataset (recset), the audio features of each soundtrack are transformed into audio words in the same way as the development dataset do. In the test phase, given a test video, the descriptors of visual features are transformed into visual words in the same way as those of the devset The transformed visual words of the test video along with the audio words of recset are fed into the learned content correlation model and the ranking results for soundtrack selection are generated.

AUDIO WORD EXTRACTION

We use the officially provided audio features including Beat, Key, MFCC, BLF, and PS09 [4] and transform into audio words by discretization or vector quantization (VQ). For one-dimensional descriptors such as the descriptors of Beat, the equal frequency binning is employed for discretization. The number of bins is set to 19, which is the square root of the number of devset [7]. For the multidimensional descriptor, clustering-based vector quantization is performed to group descriptors in the feature space into clusters. For the descriptors of BLF, we use Manhattan distance to measure the distance and utilize the average link and complete link respectively. For the descriptors of PS09 and the FP descriptor of MFCC, we use the Euclidean distance along with the K-means. For each of the three descriptors of MFCC, Gaussian Mixture Model is utilized to model the frame-based representation of an audio. Then K-L divergence along with Earth Mover distance is used to measure the distance, followed by average link and complete link clustering algorithms. After vector quantization/discretization, each cluster/bin may be regarded as an audio word that represents the descriptor belonging to that cluster/bin. An audio descriptor is encoded into an audio word vector by the index of the cluster/bin to which it belongs. An audio word vector contains the presence or absence information of each audio word in the soundtrack while the audio feature vector for a soundtrack is formed by the concatenation of the audio word vectors respective to all types of descriptors.

VISUAL WORD EXTRACTION

The officially provided visual features are based on MPEG-7. In MPEG, the determination of frame types (I, P, Bframes) depends on the compression algorithm of the MPEG encoder. While I-frames may not be key-frames, in our work, the visual features are extracted in the shot-level where the shot boundary detection is based on calculating edge change fraction in temporal domain [8]. Then we extract 13 types of visual descriptors including the color energy, saturation proportion, angular second moment, contrast, correlation, dissimilarity, entropy, homogeneity, GLCM mean, GLCM variance, light median, shadow proportion and visual excitement [1]. Since each of the 13 visual descriptors is scalar, equal frequency binning is performed for generation of visual words. Visual word vectors and visual feature vectors are encoded in the same way as audio word vectors and audio feature vectors.

CONTENT CORRELATION MODELING & RECOMMENDATION

We investigate two approaches for learning correlation between audio and visual contents from devset.

CFA (Cross-Modal Factor Analysis)

CFA tries to find the correlation by transforming the audio and visual contents into a common space [2]. Given an audio feature matrix X and a video feature matrix Y where each row corresponds to the feature vector of a commercial, CFA finds the orthonormal transformation matrices A and B that minimize XA-YB 2 where M is the Frobenius norm of matrix M. Matrices A and B can be obtained by Singular Value Decomposition (SVD) on X T Y such that A=U xy , B=V xy , where X T Y = U xy S xy V xy . Matrices A and B encode the correlation information. In our work, given a test video f with visual feature vector y f and a soundtrack m with audio feature vector x m , the distance d(m, f) between m and f is the Euclidean distance between x m A and y f B. The nearest five soundtracks in recset are recommended for each test video.

MLSA (Multi-type Latent Semantic Analysis)

The other approach we adopted is MLSA that exploits pairwise co-occurrence correlations among multiple types of entities (descriptors). MLSA represents the entities and correlations by a unified co-occurrence matrix

𝐶 = 0 𝑀 !" ⋯ 𝑀 !! 𝑀 !" 0 ⋯ 𝑀 !! ⋮ ⋮ ⋱ ⋮ 𝑀 !! 𝑀 !! ⋯ 0 C is composed of N ×N correlation matrices,

where 𝑁 is the total number of descriptor types. 𝑀 !" is the cooccurrence matrix of descriptor type i and j. C can be decomposed by eigen decomposition. The top k eigenvalues 𝜆 ! ≥ 𝜆 ! ≥ ⋯ ≥ 𝜆 ! and the corresponding eigenvectors [e 1 , e 2 , ..., e k ] can span a k-dimensional latent space, which can be represented as an matrix

C k = [λ 1 •e 1 , λ 2 •e 2 , …, λ g •e k ].

Given a test video f with feature vector y f , we first generate the query vector y q by concatenating y f with zero audio feature vector. To project onto the latent space, y q is multiplied by C k . The likelihood of occurrence l(a,f) between an audio descriptor a and the test video f is the cosine similarity between y q C k and the row vector of C k corresponding to the audio descriptor a. Then the similarity score between a sound track m and the test video f 𝑟 𝑚, 𝑓 = 𝑙(𝑎, 𝑓)

∀ ! ∈! . The top five soundtracks in recset are recommended for each test video.

PERFORMANCE EVALUATION

We take five-fold cross-validation on the devset to evaluate the performance of our approach and select the best three models to obtain the ranking result. The original soundtrack of the commercial is regarded as the ground truth and is ranked along with music objects in recset. The accuracy in our work is defined as 1-(rank(g)-1)/(|C|+1) where rank(g) is the rank of the ground truth, 𝐶 is the number of music in recset. Results with top-2 accuracy for CFA and top-1 accuracy for MLSA are submitted. Table 1 shows the adopted learning algorithms, parameters, accuracy, and the officially rated score of our submitted three results.

Figure 1 :1Figure 1: System Architecture of Our Approaches [1].

Table 1 .1Performance and Parameters of Submitted Results.AlgorithmCFACFAMLSANo. Clusters(GMM, MFCC)101010No. Clusters(KL, MFCC)101010No. Clusters(FP)202010No. Clusters (BLF)301020Eigen-number200150400Accuracy0.6700.6730.547First rank average2.2922.2892.272Top-five average2.2642.2592.211

Background Music Recommendation for Video Based on Multimodal Latent Semantic Analysis FFKuo MKShan SYLee IEEE Intl. Conf. on Multimedia and Expo 2013 Multimedia Content Processing through Cross-Modal Association DLi NDimitrova MLi IKSethi ACM Intl. Conf. on Multimedia 2003 When Music Makes a Scene -Characterizing Music in Multimedia Contexts Via User Scene Descriptions CC SLiem MLarson AHanjalic Intl. Journal of Multimedia Information Retrieval 2 1 2013 On Rhythm and General Music Similarity TPohle DSchmitzer MSchedl PKnees GWidmer Intl. Symp. for Music Information Retrieval 2009 Minimal Test Collections for Low-Cost Evaluation of Audio Music Similarity and Retrieval Systems JUrbano MSchedl Intl. Journal of Multimedia Information Retrieval 2013 Latent Semantic Analysis for Multiple-Type Interrelated Data Objects XWang JTSun ZChen CXZhai ACM Intl. Conf. on Information Retrieval 2013 Proportional k-Interval Discretization for Naïve-Bayes Classifiers YYang GWebb European Conf. on Machine Learning 2001 A Feature-based Algorithm for Detecting and Classifying Scene Breaks RZabih JMiller KMai ACM Intl. Conf. on Multimedia 1995