<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sequential pattern mining on multimedia data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Corentin Hardy</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurent Amsaleg</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillaume Gravier</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Malinowski</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>RenØ Quiniou PIRISA/Inria Rennes</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>France</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Analyzing multimedia data is a challenging problem due to the quantity and complexity of such data. Mining for frequently recurring patterns is a task often ran to help discovering the underlying structure hidden in the data. In this article, we propose audio data symbolization and sequential pattern mining methods to extract patterns from audio streams. Experiments show that this task is hard and that the symbolization is a critical step for extracting relevant audio patterns.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Related work</title>
      <p>
        Motif discovery relies either on raw time series processing or on mining a symbolic
version [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3,4,5</xref>
        ]. In the rst kind of approaches, algorithms are mostly built on
Copyright ⃝c2015 for this paper by its authors. Copying permitted for private and academic
purposes.
the DTW distance which can deal with temporal distortions that often occurs
in audio signals [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Muscariello et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have proposed an extended version of
the DTW for nding the best occurrence of a seed in a longer subsequence. This
kind of approaches is ecient in terms of accuracy as the signal is completely
exploited but the computational cost of the DTW distance prevents its use on
very large databases.
      </p>
      <p>
        Other approaches working with a symbolized version of the audio signal
mostly use algorithms from bioinformatics to extract motifs. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], the MEME
algorithm [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is used to estimate a statistical model for each discovered motif.
In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the SNAP algorithm [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is used to search by query near-duplicate video
sequences.
      </p>
      <p>Some algorithms coming from bioinformatics are very ecient, but have been
optimized to work with alphabets of very small size (from 4 to 20). In this paper,
we consider the use of sequential pattern mining algorithms for discovering motifs
in audio data.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Pattern mining on audio data</title>
      <p>In this section, we explain how we used sequential pattern mining algorithms
to discover repeating patterns in audio data. As pattern mining algorithms deal
with symbolic sequences, we present rst how to transform time series related to
audio data into symbolic sequences. Then we show how to use sequential pattern
algorithms on symbolic sequences.</p>
      <p>
        MFCC (Mel-frequency cepstral coecients) is a popular method for
representing audio signals. First, MFCC coecients are extracted from the raw audio
signal (with a sliding window) yielding a 13-dimensional time series. Then, this
multivariate time series is transformed into a sequence of symbols. Many
methods have been proposed for transforming time series into a sequence of symbols.
Here, we have chosen to use a method proposed by Wang et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We have
also tried the very popular SAX approach [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. SAX symbols contain very few
information about the original signal (only the average value on a window). This
symbolisation technique is less adapted to our problem and produced worse
results.
      </p>
      <p>To this end, each dimension of the 13-dimensional time series is divided into
consecutive non-overlapping windows of length . The 13 sub-series related to the
same window are then concatenated (respecting the order of the MFCC data).
The resulting vectors of size 13 are then clustered by a k-means algorithm
for building a codebook, each word in the codebook corresponding to a cluster.
Finally, the original multivariate time series is coded into a sequence of symbols
by assigning to each window the symbol in the codebook corresponding to the
closest cluster centroid. This symbolization process is sketched in Figures 1a
and 1b.</p>
      <p>The representation above could be too imprecise as it mixes coecients of
very dierent order. To cope with this problem we propose to divide the 13
dimensions into 2 or more sub-bands of consecutive dimensions that represent
(a) K-means clustering is performed on
the set of windows to build a codebook
(of size 6, here).
(b) Every window is labeled by the
symbol associated with the closest cluster
centroid.
(c) Conversion of a 2 sub-band times series into a sequence of itemsets using 2 codebook
of size 5.
more closely related dimensions. The same transformation described above
operates on sub-bands and yields one codebook per sub-band. There are thus as
many symbolic sequences as there are sub-bands. Finally, the sub-band symbols
related to the same windows are grouped into itemsets in the Figure 1c.</p>
      <p>
        Once the raw signal is transformed into a symbolic sequence of items or
itemsets, classical sequential motif discovery algorithms can be applied. Two
kinds of sequential pattern discovery algorithms have been proposed: algorithms
that process sequences of items and algorithms that process sequences of itemsets
(an itemset is a set of items that occur in a short time period). We have chosen
to evaluate one algorithm of each kind in this paper: MaxMotif [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and
CMPMiner [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] that process respectively sequences of items and sequences of itemsets.
      </p>
      <p>Note that, in the classical setting of sequential pattern mining, a pattern
occurrence may skip symbols in the sequence. For instance, acccb is an occurrence
of pattern ab in sequence dacccbe. Generally, algorithms provide means to put
constraints on extracted motifs, such as minimum and maximal motif length
and the allowed gaps; gaps are symbols that can be skipped when looking for a
pattern occurrence. In our application, it is crucial to allow gaps in motifs since
temporal distortions often occurs in audio signals.</p>
      <p>MaxMotif enumerates all frequent (with respect to a given minimal support)
closed patterns in a database of item sequences. MaxMotif allows gaps in the
temporal domain (represented by the wildcard symbol ). For instance, pattern
(f a) occurs in sequence (efcaefbaab) at positions 2 and 6.</p>
      <p>
        CMP-Miner extracts all frequent closed patterns in a database of itemset
sequences. It uses the PrexSpan projection principle [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and the BIDE
bidirectional checking [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. CMP-Miner allows gaps both in the temporal domain
and inside an itemset. For instance, pattern ( b c ) occurs in sequence
f g j
( e b a c e b d c c a )
      </p>
      <p>at positions 2 and 6.
i f f g j f h g j h</p>
      <p>The parameters of the two methods are described in Table 1.
We present in this section some results from two experiments, one on a synthetic
dataset and the other on a real dataset.
In this rst experiment, we have created a dataset composed of 30 audio
signals corresponding to 10 utterances of the 3 words aaires, mondiale and
cinquante pronounced by several French speakers. Our goal is to evaluate the
impact of the codebook size on the extracted motifs. The two algorithms
presented above have been applied on this dataset with the following parameters:
= 5, minSupport = 4, maxGap = 1, minLength = 4, maxLength = 20. For
CMP-Miner we set = 3 and minItem = 2. These parameter settings were
chosen after extensive tests on possible value ranges.</p>
      <p>First, sequential patterns are extracted. Then, we associate with each pattern
the word in the utterances of which this pattern most often occurs. For each
extracted pattern, a precision/recall score is computed. Figure 2a and 2b depict
the precision/recall score versus the codebook size for MaxMotif and
CMPMiner. As can be seen, MaxMotif obtains the best eciency. This gure also
shows that when the codebook size increases, the precision improves slightly but
not the recall.</p>
      <p>Figure 2c shows the pattern length distribution for dierent codebook sizes
for MaxMotif. For small codebooks, many long patterns are extracted.
However, they are not very accurate because, being general, they can occur in many
dierent sequences. For big codebooks, many pattern candidates can be found,
reecting sequence variability. However, many candidates have a low support,
often under the minimal threshold, and, so, less patterns are extracted.</p>
      <p>
        The symbolization step is crucial. Figure 2d shows ve symbolic
representations of the word cinquante for a codebook of size 15. These strings highlight
the two kinds of variability (spectral and temporal) that makes the task hard for
mining algorithms in this example. The same experiment was performed using
the SAX symbolization method [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] on each dimension of the multidimensional
times series. This representation revealed to be less accurate. Indeed, the results
obtained by CMP-Miner using the SAX representation were worse. There is no
space to detail these results here.
4.2
      </p>
      <p>Experiment on a larger database
Now, we consider a dataset containing 7 hours of audio content. The dataset is
divided into 21 audio tracks coming from various radio stations. This experience
is closer to a real setting.</p>
      <p>Only MaxMotif has been tested on this dataset. The parameters were: = 4,
= 80, minSupport = 40, maxGap = 1, minLength = 5, maxLength = 20.
The codebook size is greater than in the previous experiment to deal with more
dierent sounds. Pattern extraction is very fast: less than 4 minutes for more
than one million of patterns. Some of them are interesting and correspond, for
instance, to crowd noises, jingle and music patterns or short silence. However,
similarly to the experiment on the synthetic dataset, only very few patterns
corresponding to repeated words could be extracted.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we have presented a preliminary work investigating how to use
sequential pattern mining algorithms for audio data. The aim of this work was
to evaluate whether these algorithms could be relevant for this problem. The
experiments pointed out the diculty to mine audio signals, because of temporal
and spectral distortion. Same words pronounced in dierent contexts and by
dierent speakers can be very dierent and yield very dierent patterns. The
results are promising but both symbolization and motif extraction should be
improved. For instance, to account for spectral variability, considering distances
between symbols should improve the overall performance of pattern extraction.
(a) Precision/Recall curves for MaxMotif.
(b) Precision/Recall curves for CMP-Miner
with 3 sub-bands.
(c) Pattern size distribution for dierent
size of codebook.
(d) Example of representation for
a codebook of size 15.</p>
      <p>We have also noticed that all the dimensions of the MFCC times series are
not as important for the discovery. Selecting or weighting the dimensions of
multidimensional time series could improve the performance too.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueen</surname>
          </string-name>
          ,
          <article-title>Enumeration of time series motifs of all lengths</article-title>
          ,
          <source>in Data Mining (ICDM)</source>
          ,
          <year>2013</year>
          IEEE 13th International Conference on , pp.
          <fpage>547556</fpage>
          ,
          <year>Dec 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Keogh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wei</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Lonardi</surname>
          </string-name>
          ,
          <article-title>Experiencing sax: A novel symbolic representation of time series, Data Min</article-title>
          . Knowl. Discov. , vol.
          <volume>15</volume>
          , pp.
          <fpage>107144</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Herley</surname>
          </string-name>
          , ARGOS:
          <article-title>Automatically extracting repeating objects from multimedia streams</article-title>
          ,
          <source>IEEE Trans. on Multimedia</source>
          , vol.
          <volume>8</volume>
          , pp.
          <fpage>115129</fpage>
          ,
          <string-name>
            <surname>Feb</surname>
          </string-name>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P.</given-names>
            <surname>Esling</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Agon</surname>
          </string-name>
          ,
          <article-title>Time-series data mining</article-title>
          ,
          <source>ACM Computing Surveys (CSUR)</source>
          , vol.
          <volume>45</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>12</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C. H.</given-names>
            <surname>Mooney</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Roddick</surname>
          </string-name>
          ,
          <article-title>Sequential pattern mining approaches and algorithms</article-title>
          ,
          <source>ACM Comput. Surv.</source>
          , vol.
          <volume>45</volume>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Park</surname>
          </string-name>
          and
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Glass</surname>
          </string-name>
          ,
          <article-title>Unsupervised pattern discovery in speech</article-title>
          ,
          <source>IEEE Transaction on Acoustic, Speech and Language Processing</source>
          , vol.
          <volume>16</volume>
          , pp.
          <fpage>186197</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Muscariello</surname>
          </string-name>
          , G. Gravier, and
          <string-name>
            <given-names>F.</given-names>
            <surname>Bimbot</surname>
          </string-name>
          ,
          <article-title>Audio keyword extraction by unsupervised word discovery</article-title>
          ,
          <source>in INTERSPEECH 2009: 10th Annual Conference of the International Speech Communication Association</source>
          , (Brighton, United Kingdom),
          <source>Sept</source>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Burred</surname>
          </string-name>
          ,
          <article-title>Genetic motif discovery applied to audio analysis</article-title>
          ,
          <source>in International Conference on Acoustics, Speech and Signal Processing</source>
          , pp.
          <fpage>361364</fpage>
          ,
          <issue>IEEE</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>T.</given-names>
            <surname>Bailey</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Elkan</surname>
          </string-name>
          ,
          <article-title>Unsupervised learning of multiple motifs in biopolymers using expectation maximization</article-title>
          ,
          <source>Machine Learning</source>
          , vol.
          <volume>21</volume>
          , no.
          <issue>1-2</issue>
          , pp.
          <fpage>5180</fpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. L. S. d. Oliveira,
          <string-name>
            <given-names>Z. K.</given-names>
            do
            <surname>Patrocinio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J. F.</given-names>
            <surname>Guimarªes</surname>
          </string-name>
          , and G. Gravier,
          <article-title>Searching for near-duplicate video sequences from a scalable sequence aligner</article-title>
          , in International Symposium on Multimedia , pp.
          <fpage>223226</fpage>
          ,
          <issue>IEEE</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>M. Zaharia</surname>
            ,
            <given-names>W. J.</given-names>
          </string-name>
          <string-name>
            <surname>Bolosky</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Curtis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fox</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Patterson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Shenker</surname>
            ,
            <given-names>I. Stoica</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Karp</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sittler</surname>
          </string-name>
          ,
          <article-title>Faster and more accurate sequence alignment with snap</article-title>
          ,
          <source>arXiv preprint arXiv:1111.5572</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Megalooikonomou</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          ,
          <article-title>Time series analysis with multiple resolutions</article-title>
          ,
          <source>Information Systems</source>
          , vol.
          <volume>35</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>5674</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>H.</given-names>
            <surname>Arimura</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Uno</surname>
          </string-name>
          ,
          <article-title>An ecient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence</article-title>
          ,
          <source>Journal of Combinatorial Optimization</source>
          , vol.
          <volume>13</volume>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>A. J. Lee</surname>
            , H.-W. Wu, T.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          K.-T. Chen,
          <article-title>Mining closed patterns in multi-sequence time-series databases</article-title>
          ,
          <source>Data &amp; Knowledge Engineering</source>
          , vol.
          <volume>68</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>1071</fpage>
          <lpage>1090</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>J. Pei</surname>
            , J. Han,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mortazavi-Asl</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          <string-name>
            <surname>Dayal</surname>
          </string-name>
          , and
          <string-name>
            <surname>M.-C. Hsu</surname>
          </string-name>
          , Prexspan:
          <article-title>Mining sequential patterns eciently by prex-projected pattern growth</article-title>
          ,
          <source>in International Conference on Data Engineering</source>
          , p.
          <fpage>0215</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          and J. Han,
          <article-title>Bide: ecient mining of frequent closed sequences</article-title>
          ,
          <source>in Data Engineering, Proceedings. 20th International Conference on Data Engineering</source>
          , pp.
          <fpage>7990</fpage>
          ,
          <year>March 2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>