<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Han Su</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fang-Fei Kuo</string-name>
          <email>ffkuo@uw.edu2</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chu-Hsiang Chiu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yen-Ju Chou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Man-Kwan Shan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, National Chengchi University</institution>
          ,
          <addr-line>Taipei</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electrical Engineering, University of Washington</institution>
          ,
          <addr-line>Washington, America</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This paper presents our approaches of soundtrack selection for commercials based on audio/visual correlation analysis. Two approaches are adopted. One is based on multimodal latent semantic analysis (MLSA) and the other is based on cross-modal factor analysis (CFA). The evaluation based on the MediaEval Soundtrack Selection for Commercials Dataset shows the performance of our systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Soundtrack selection</kwd>
        <kwd>Multimodal correlation analysis</kwd>
        <kwd>Multi-type latent semantic analysis</kwd>
        <kwd>Cross-modal factor analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>2. SYSTEM ARCHITECTURE</title>
      <p>
        Figure 1 shows the architecture of the proposed soundtrack
selection based on our previous work [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In the training
phase, we first transform the descriptors of audio/visual
features provided in the development dataset (devset) to the
audio /visual words and generate the audio/visual feature
matrices. Then two algorithms are employed to find the
content correlation model from the visual/audio feature
matrices. For the recommendation dataset (recset), the
audio features of each soundtrack are transformed into
audio words in the same way as the development dataset do.
In the test phase, given a test video, the descriptors of
visual features are transformed into visual words in the
same way as those of the devset The transformed visual
words of the test video along with the audio words of recset
are fed into the learned content correlation model and the
ranking results for soundtrack selection are generated.
Copyright is held by the author/owner(s).
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. AUDIO WORD EXTRACTION</title>
      <p>
        We use the officially provided audio features including
Beat, Key, MFCC, BLF, and PS09 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and transform into
audio words by discretization or vector quantization (VQ).
For one-dimensional descriptors such as the descriptors of
Beat, the equal frequency binning is employed for
discretization. The number of bins is set to 19, which is the
square root of the number of devset [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. For the
multidimensional descriptor, clustering-based vector
quantization is performed to group descriptors in the
feature space into clusters. For the descriptors of BLF, we
use Manhattan distance to measure the distance and utilize
the average link and complete link respectively. For the
descriptors of PS09 and the FP descriptor of MFCC, we use
the Euclidean distance along with the K-means. For each of
the three descriptors of MFCC, Gaussian Mixture Model is
utilized to model the frame-based representation of an
audio. Then K-L divergence along with Earth Mover
distance is used to measure the distance, followed by
average link and complete link clustering algorithms.
After vector quantization/discretization, each cluster/bin
may be regarded as an audio word that represents the
descriptor belonging to that cluster/bin. An audio descriptor
is encoded into an audio word vector by the index of the
cluster/bin to which it belongs. An audio word vector
contains the presence or absence information of each audio
word in the soundtrack while the audio feature vector for
a soundtrack is formed by the concatenation of the audio
word vectors respective to all types of descriptors.
      </p>
    </sec>
    <sec id="sec-3">
      <title>4. VISUAL WORD EXTRACTION</title>
      <p>
        The officially provided visual features are based on
MPEG7. In MPEG, the determination of frame types (I, P,
Bframes) depends on the compression algorithm of the
MPEG encoder. While I-frames may not be key-frames, in
our work, the visual features are extracted in the shot-level
where the shot boundary detection is based on calculating
edge change fraction in temporal domain [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Then we
extract 13 types of visual descriptors including the color
energy, saturation proportion, angular second moment,
contrast, correlation, dissimilarity, entropy, homogeneity,
GLCM mean, GLCM variance, light median, shadow
proportion and visual excitement [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since each of the 13
visual descriptors is scalar, equal frequency binning is
performed for generation of visual words. Visual word
vectors and visual feature vectors are encoded in the same
way as audio word vectors and audio feature vectors.
      </p>
    </sec>
    <sec id="sec-4">
      <title>5. CONTENT CORRELATION MODELING &amp; RECOMMENDATION</title>
      <p>We investigate two approaches for learning correlation
between audio and visual contents from devset.</p>
    </sec>
    <sec id="sec-5">
      <title>5.1 CFA (Cross-Modal Factor Analysis)</title>
      <p>
        CFA tries to find the correlation by transforming the audio
and visual contents into a common space [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Given an
audio feature matrix X and a video feature matrix Y
where each row corresponds to the feature vector of a
commercial, CFA finds the orthonormal transformation
matrices A and B that minimize XA-YB2 where M is the
Frobenius norm of matrix M. Matrices A and B can be
obtained by Singular Value Decomposition (SVD) on XTY
such that A=Uxy, B=Vxy, where XTY = UxySxyVxy. Matrices A
and B encode the correlation information. In our work,
given a test video f with visual feature vector yf and a
soundtrack m with audio feature vector xm, the distance d(m,
f) between m and f is the Euclidean distance between xmA
and yfB. The nearest five soundtracks in recset are
recommended for each test video.
      </p>
    </sec>
    <sec id="sec-6">
      <title>5.2 MLSA (Multi-type Latent Semantic</title>
    </sec>
    <sec id="sec-7">
      <title>Analysis)</title>
      <p>The other approach we adopted is MLSA that exploits
pairwise co-occurrence correlations among multiple types
of entities (descriptors). MLSA represents the entities and
correlations by a unified co-occurrence matrix
     0            !"           ⋯       !!
     !"      0                 ⋯       !!
 =    ⋮              ⋮                  ⋱                ⋮</p>
      <p>!!    !!       ⋯          0
C is composed of N ×N correlation matrices, where  is
the total number of descriptor types. !" is the
cooccurrence matrix of descriptor type i and j. C can be
decomposed by eigen decomposition. The top k
eigenvalues ! ≥ ! ≥ ⋯ ≥ ! and the corresponding
eigenvectors [e1, e2, ..., ek] can span a k-dimensional latent
space, which can be represented as an matrix Ck = [λ1·e1,
λ2·e2, …, λg·ek]. Given a test video f with feature vector yf,
we first generate the query vector yq by concatenating yf
with zero audio feature vector. To project onto the latent
space, yq is multiplied by Ck. The likelihood of occurrence
l(a,f) between an audio descriptor a and the test video f is
the cosine similarity between yqCk and the row vector of Ck
corresponding to the audio descriptor a. Then the similarity
score between a sound track m and the test video f
 ,  = ∀ !  ∈! (, ).</p>
      <p>The top five soundtracks in recset are recommended for
each test video.</p>
    </sec>
    <sec id="sec-8">
      <title>6. PERFORMANCE EVALUATION</title>
      <p>We take five-fold cross-validation on the devset to evaluate
the performance of our approach and select the best three
models to obtain the ranking result. The original soundtrack
of the commercial is regarded as the ground truth and is
ranked along with music objects in recset. The accuracy in
our work is defined as 1-(rank(g)-1)/(|C|+1) where rank(g)
is the rank of the ground truth,  is the number of music
in recset. Results with top-2 accuracy for CFA and top-1
accuracy for MLSA are submitted. Table 1 shows the
adopted learning algorithms, parameters, accuracy, and the
officially rated score of our submitted three results.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Kuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Shan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. Y.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Background Music Recommendation for Video Based on Multimodal Latent Semantic Analysis</article-title>
          ,
          <source>IEEE Intl. Conf. on Multimedia and Expo</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dimitrova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. K.</given-names>
            <surname>Sethi</surname>
          </string-name>
          ,
          <source>Multimedia Content Processing through Cross-Modal Association, ACM Intl. Conf. on Multimedia</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C. C. S.</given-names>
            <surname>Liem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanjalic</surname>
          </string-name>
          ,
          <article-title>When Music Makes a Scene - Characterizing Music in Multimedia Contexts Via User Scene Descriptions, Intl</article-title>
          .
          <source>Journal of Multimedia Information Retrieval</source>
          , Vol.
          <volume>2</volume>
          ,
          <issue>Issue</issue>
          ,
          <volume>1</volume>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pohle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schmitzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knees</surname>
          </string-name>
          , and G. Widmer, On Rhythm and
          <article-title>General Music Similarity, Intl</article-title>
          .
          <source>Symp. for Music Information Retrieval</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Urbano</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Schedl</surname>
          </string-name>
          ,
          <article-title>Minimal Test Collections for Low-Cost Evaluation of Audio Music Similarity and Retrieval Systems, Intl</article-title>
          .
          <source>Journal of Multimedia Information Retrieval</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <article-title>Latent Semantic Analysis for Multiple-Type Interrelated Data Objects</article-title>
          ,
          <source>ACM Intl. Conf. on Information Retrieval</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Webb</surname>
          </string-name>
          ,
          <article-title>Proportional k-Interval Discretization for Naïve-Bayes Classifiers</article-title>
          ,
          <source>European Conf. on Machine Learning</source>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zabih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Miller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Mai</surname>
          </string-name>
          ,
          <article-title>A Feature-based Algorithm for Detecting and Classifying Scene Breaks</article-title>
          ,
          <source>ACM Intl. Conf. on Multimedia</source>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>