<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Han</forename><surname>Su</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Electrical Engineering</orgName>
								<orgName type="institution">University of Washington</orgName>
								<address>
									<settlement>Washington</settlement>
									<country>America</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fang-Fei</forename><surname>Kuo</surname></persName>
							<email>ffkuo@uw.edu</email>
							<affiliation key="aff2">
								<address>
									<postCode>{101753004, 101753026, 101971001</postCode>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chu-Hsiang</forename><surname>Chiu</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Electrical Engineering</orgName>
								<orgName type="institution">University of Washington</orgName>
								<address>
									<settlement>Washington</settlement>
									<country>America</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yen-Ju</forename><surname>Chou</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">Department of Electrical Engineering</orgName>
								<orgName type="institution">University of Washington</orgName>
								<address>
									<settlement>Washington</settlement>
									<country>America</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Man-Kwan</forename><surname>Shan</surname></persName>
							<email>mkshan@nccu.edu</email>
							<affiliation key="aff1">
								<orgName type="department">Department of Electrical Engineering</orgName>
								<orgName type="institution">University of Washington</orgName>
								<address>
									<settlement>Washington</settlement>
									<country>America</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">National Chengchi University</orgName>
								<address>
									<settlement>Taipei</settlement>
									<country key="TW">Taiwan</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">4AA054D6EE37F471BE60E1296800D2E4</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-19T17:57+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Soundtrack selection</term>
					<term>Multimodal correlation analysis</term>
					<term>Multi-type latent semantic analysis</term>
					<term>Cross-modal factor analysis</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents our approaches of soundtrack selection for commercials based on audio/visual correlation analysis. Two approaches are adopted. One is based on multimodal latent semantic analysis (MLSA) and the other is based on cross-modal factor analysis (CFA). The evaluation based on the MediaEval Soundtrack Selection for Commercials Dataset shows the performance of our systems.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">MOTIVATION</head><p>Automatic soundtrack selection for videos has received more and more attention. The rationale of our approach for automatic soundtrack selection is based on the latent correlation of the video and audio from training data (Development Dataset). Two methods of multimodal correlation model learning are utilized in our approach. In this paper, we present our soundtrack recommendation using the two methods respectively and evaluate the system on the MediaEval corpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">SYSTEM ARCHITECTURE</head><p>Figure <ref type="figure" target="#fig_0">1</ref> shows the architecture of the proposed soundtrack selection based on our previous work <ref type="bibr" target="#b0">[1]</ref>. In the training phase, we first transform the descriptors of audio/visual features provided in the development dataset (devset) to the audio /visual words and generate the audio/visual feature matrices. Then two algorithms are employed to find the content correlation model from the visual/audio feature matrices. For the recommendation dataset (recset), the audio features of each soundtrack are transformed into audio words in the same way as the development dataset do. In the test phase, given a test video, the descriptors of visual features are transformed into visual words in the same way as those of the devset The transformed visual words of the test video along with the audio words of recset are fed into the learned content correlation model and the ranking results for soundtrack selection are generated. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">AUDIO WORD EXTRACTION</head><p>We use the officially provided audio features including Beat, Key, MFCC, BLF, and PS09 <ref type="bibr" target="#b3">[4]</ref> and transform into audio words by discretization or vector quantization (VQ). For one-dimensional descriptors such as the descriptors of Beat, the equal frequency binning is employed for discretization. The number of bins is set to 19, which is the square root of the number of devset <ref type="bibr" target="#b6">[7]</ref>. For the multidimensional descriptor, clustering-based vector quantization is performed to group descriptors in the feature space into clusters. For the descriptors of BLF, we use Manhattan distance to measure the distance and utilize the average link and complete link respectively. For the descriptors of PS09 and the FP descriptor of MFCC, we use the Euclidean distance along with the K-means. For each of the three descriptors of MFCC, Gaussian Mixture Model is utilized to model the frame-based representation of an audio. Then K-L divergence along with Earth Mover distance is used to measure the distance, followed by average link and complete link clustering algorithms. After vector quantization/discretization, each cluster/bin may be regarded as an audio word that represents the descriptor belonging to that cluster/bin. An audio descriptor is encoded into an audio word vector by the index of the cluster/bin to which it belongs. An audio word vector contains the presence or absence information of each audio word in the soundtrack while the audio feature vector for a soundtrack is formed by the concatenation of the audio word vectors respective to all types of descriptors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">VISUAL WORD EXTRACTION</head><p>The officially provided visual features are based on MPEG-7. In MPEG, the determination of frame types (I, P, Bframes) depends on the compression algorithm of the MPEG encoder. While I-frames may not be key-frames, in our work, the visual features are extracted in the shot-level where the shot boundary detection is based on calculating edge change fraction in temporal domain <ref type="bibr" target="#b7">[8]</ref>. Then we extract 13 types of visual descriptors including the color energy, saturation proportion, angular second moment, contrast, correlation, dissimilarity, entropy, homogeneity, GLCM mean, GLCM variance, light median, shadow proportion and visual excitement <ref type="bibr" target="#b0">[1]</ref>. Since each of the 13 visual descriptors is scalar, equal frequency binning is performed for generation of visual words. Visual word vectors and visual feature vectors are encoded in the same way as audio word vectors and audio feature vectors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">CONTENT CORRELATION MODELING &amp; RECOMMENDATION</head><p>We investigate two approaches for learning correlation between audio and visual contents from devset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">CFA (Cross-Modal Factor Analysis)</head><p>CFA tries to find the correlation by transforming the audio and visual contents into a common space <ref type="bibr" target="#b1">[2]</ref>. Given an audio feature matrix X and a video feature matrix Y where each row corresponds to the feature vector of a commercial, CFA finds the orthonormal transformation matrices A and B that minimize XA-YB 2 where M is the Frobenius norm of matrix M. Matrices A and B can be obtained by Singular Value Decomposition (SVD) on X T Y such that A=U xy , B=V xy , where X T Y = U xy S xy V xy . Matrices A and B encode the correlation information. In our work, given a test video f with visual feature vector y f and a soundtrack m with audio feature vector x m , the distance d(m, f) between m and f is the Euclidean distance between x m A and y f B. The nearest five soundtracks in recset are recommended for each test video.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">MLSA (Multi-type Latent Semantic Analysis)</head><p>The other approach we adopted is MLSA that exploits pairwise co-occurrence correlations among multiple types of entities (descriptors). MLSA represents the entities and correlations by a unified co-occurrence matrix</p><formula xml:id="formula_0">𝐶 = 0 𝑀 !" ⋯ 𝑀 !! 𝑀 !" 0 ⋯ 𝑀 !! ⋮ ⋮ ⋱ ⋮ 𝑀 !! 𝑀 !! ⋯ 0 C is composed of N ×N correlation matrices,</formula><p>where 𝑁 is the total number of descriptor types. 𝑀 !" is the cooccurrence matrix of descriptor type i and j. C can be decomposed by eigen decomposition. The top k eigenvalues 𝜆 ! ≥ 𝜆 ! ≥ ⋯ ≥ 𝜆 ! and the corresponding eigenvectors [e 1 , e 2 , ..., e k ] can span a k-dimensional latent space, which can be represented as an matrix</p><formula xml:id="formula_1">C k = [λ 1 •e 1 , λ 2 •e 2 , …, λ g •e k ].</formula><p>Given a test video f with feature vector y f , we first generate the query vector y q by concatenating y f with zero audio feature vector. To project onto the latent space, y q is multiplied by C k . The likelihood of occurrence l(a,f) between an audio descriptor a and the test video f is the cosine similarity between y q C k and the row vector of C k corresponding to the audio descriptor a. Then the similarity score between a sound track m and the test video f 𝑟 𝑚, 𝑓 = 𝑙(𝑎, 𝑓)</p><p>∀ ! ∈! . The top five soundtracks in recset are recommended for each test video.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">PERFORMANCE EVALUATION</head><p>We take five-fold cross-validation on the devset to evaluate the performance of our approach and select the best three models to obtain the ranking result. The original soundtrack of the commercial is regarded as the ground truth and is ranked along with music objects in recset. The accuracy in our work is defined as 1-(rank(g)-1)/(|C|+1) where rank(g) is the rank of the ground truth, 𝐶 is the number of music in recset. Results with top-2 accuracy for CFA and top-1 accuracy for MLSA are submitted. Table <ref type="table" target="#tab_0">1</ref> shows the adopted learning algorithms, parameters, accuracy, and the officially rated score of our submitted three results. </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: System Architecture of Our Approaches [1].</figDesc><graphic coords="1,333.42,206.79,209.60,136.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Performance and Parameters of Submitted Results.</figDesc><table><row><cell>Algorithm</cell><cell>CFA</cell><cell>CFA</cell><cell>MLSA</cell></row><row><cell>No. Clusters(GMM, MFCC)</cell><cell>10</cell><cell>10</cell><cell>10</cell></row><row><cell>No. Clusters(KL, MFCC)</cell><cell>10</cell><cell>10</cell><cell>10</cell></row><row><cell>No. Clusters(FP)</cell><cell>20</cell><cell>20</cell><cell>10</cell></row><row><cell>No. Clusters (BLF)</cell><cell>30</cell><cell>10</cell><cell>20</cell></row><row><cell>Eigen-number</cell><cell>200</cell><cell>150</cell><cell>400</cell></row><row><cell>Accuracy</cell><cell>0.670</cell><cell>0.673</cell><cell>0.547</cell></row><row><cell>First rank average</cell><cell>2.292</cell><cell>2.289</cell><cell>2.272</cell></row><row><cell>Top-five average</cell><cell>2.264</cell><cell>2.259</cell><cell>2.211</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Background Music Recommendation for Video Based on Multimodal Latent Semantic Analysis</title>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">F</forename><surname>Kuo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Shan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">Y</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE Intl. Conf. on Multimedia and Expo</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Multimedia Content Processing through Cross-Modal Association</title>
		<author>
			<persName><forename type="first">D</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Dimitrova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><forename type="middle">K</forename><surname>Sethi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM Intl. Conf. on Multimedia</title>
				<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">When Music Makes a Scene -Characterizing Music in Multimedia Contexts Via User Scene Descriptions</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">C S</forename><surname>Liem</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Larson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hanjalic</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Intl. Journal of Multimedia Information Retrieval</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="issue">1</biblScope>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">On Rhythm and General Music Similarity</title>
		<author>
			<persName><forename type="first">T</forename><surname>Pohle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schmitzer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schedl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Knees</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Widmer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Intl. Symp. for Music Information Retrieval</title>
				<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Minimal Test Collections for Low-Cost Evaluation of Audio Music Similarity and Retrieval Systems</title>
		<author>
			<persName><forename type="first">J</forename><surname>Urbano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schedl</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Intl. Journal of Multimedia Information Retrieval</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Latent Semantic Analysis for Multiple-Type Interrelated Data Objects</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">X</forename><surname>Zhai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM Intl. Conf. on Information Retrieval</title>
				<imprint>
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Proportional k-Interval Discretization for Naïve-Bayes Classifiers</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Webb</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conf. on Machine Learning</title>
				<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A Feature-based Algorithm for Detecting and Classifying Scene Breaks</title>
		<author>
			<persName><forename type="first">R</forename><surname>Zabih</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Miller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACM Intl. Conf. on Multimedia</title>
				<imprint>
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
