Introduction

Sequential pattern mining on multimedia data

Corentin Hardy

Laurent Amsaleg

Guillaume Gravier

Simon Malinowski

RenØ Quiniou PIRISA/Inria Rennes

France

2015

Analyzing multimedia data is a challenging problem due to the quantity and complexity of such data. Mining for frequently recurring patterns is a task often ran to help discovering the underlying structure hidden in the data. In this article, we propose audio data symbolization and sequential pattern mining methods to extract patterns from audio streams. Experiments show that this task is hard and that the symbolization is a critical step for extracting relevant audio patterns.

Introduction Related work

Motif discovery relies either on raw time series processing or on mining a symbolic version [ 3,4,5 ]. In the rst kind of approaches, algorithms are mostly built on Copyright ⃝c2015 for this paper by its authors. Copying permitted for private and academic purposes. the DTW distance which can deal with temporal distortions that often occurs in audio signals [ 6 ]. Muscariello et al. [ 7 ] have proposed an extended version of the DTW for nding the best occurrence of a seed in a longer subsequence. This kind of approaches is ecient in terms of accuracy as the signal is completely exploited but the computational cost of the DTW distance prevents its use on very large databases.

Other approaches working with a symbolized version of the audio signal mostly use algorithms from bioinformatics to extract motifs. In [ 8 ], the MEME algorithm [ 9 ] is used to estimate a statistical model for each discovered motif. In [ 10 ], the SNAP algorithm [ 11 ] is used to search by query near-duplicate video sequences.

Some algorithms coming from bioinformatics are very ecient, but have been optimized to work with alphabets of very small size (from 4 to 20). In this paper, we consider the use of sequential pattern mining algorithms for discovering motifs in audio data. 3

Pattern mining on audio data

In this section, we explain how we used sequential pattern mining algorithms to discover repeating patterns in audio data. As pattern mining algorithms deal with symbolic sequences, we present rst how to transform time series related to audio data into symbolic sequences. Then we show how to use sequential pattern algorithms on symbolic sequences.

MFCC (Mel-frequency cepstral coecients) is a popular method for representing audio signals. First, MFCC coecients are extracted from the raw audio signal (with a sliding window) yielding a 13-dimensional time series. Then, this multivariate time series is transformed into a sequence of symbols. Many methods have been proposed for transforming time series into a sequence of symbols. Here, we have chosen to use a method proposed by Wang et al. [ 12 ]. We have also tried the very popular SAX approach [ 2 ]. SAX symbols contain very few information about the original signal (only the average value on a window). This symbolisation technique is less adapted to our problem and produced worse results.

To this end, each dimension of the 13-dimensional time series is divided into consecutive non-overlapping windows of length . The 13 sub-series related to the same window are then concatenated (respecting the order of the MFCC data). The resulting vectors of size 13 are then clustered by a k-means algorithm for building a codebook, each word in the codebook corresponding to a cluster. Finally, the original multivariate time series is coded into a sequence of symbols by assigning to each window the symbol in the codebook corresponding to the closest cluster centroid. This symbolization process is sketched in Figures 1a and 1b.

The representation above could be too imprecise as it mixes coecients of very dierent order. To cope with this problem we propose to divide the 13 dimensions into 2 or more sub-bands of consecutive dimensions that represent (a) K-means clustering is performed on the set of windows to build a codebook (of size 6, here). (b) Every window is labeled by the symbol associated with the closest cluster centroid. (c) Conversion of a 2 sub-band times series into a sequence of itemsets using 2 codebook of size 5. more closely related dimensions. The same transformation described above operates on sub-bands and yields one codebook per sub-band. There are thus as many symbolic sequences as there are sub-bands. Finally, the sub-band symbols related to the same windows are grouped into itemsets in the Figure 1c.

Once the raw signal is transformed into a symbolic sequence of items or itemsets, classical sequential motif discovery algorithms can be applied. Two kinds of sequential pattern discovery algorithms have been proposed: algorithms that process sequences of items and algorithms that process sequences of itemsets (an itemset is a set of items that occur in a short time period). We have chosen to evaluate one algorithm of each kind in this paper: MaxMotif [ 13 ] and CMPMiner [ 14 ] that process respectively sequences of items and sequences of itemsets.

Note that, in the classical setting of sequential pattern mining, a pattern occurrence may skip symbols in the sequence. For instance, acccb is an occurrence of pattern ab in sequence dacccbe. Generally, algorithms provide means to put constraints on extracted motifs, such as minimum and maximal motif length and the allowed gaps; gaps are symbols that can be skipped when looking for a pattern occurrence. In our application, it is crucial to allow gaps in motifs since temporal distortions often occurs in audio signals.

MaxMotif enumerates all frequent (with respect to a given minimal support) closed patterns in a database of item sequences. MaxMotif allows gaps in the temporal domain (represented by the wildcard symbol ). For instance, pattern (f a) occurs in sequence (efcaefbaab) at positions 2 and 6.

CMP-Miner extracts all frequent closed patterns in a database of itemset sequences. It uses the PrexSpan projection principle [ 15 ] and the BIDE bidirectional checking [ 16 ]. CMP-Miner allows gaps both in the temporal domain and inside an itemset. For instance, pattern ( b c ) occurs in sequence f g j ( e b a c e b d c c a )

at positions 2 and 6. i f f g j f h g j h

The parameters of the two methods are described in Table 1. We present in this section some results from two experiments, one on a synthetic dataset and the other on a real dataset. In this rst experiment, we have created a dataset composed of 30 audio signals corresponding to 10 utterances of the 3 words aaires, mondiale and cinquante pronounced by several French speakers. Our goal is to evaluate the impact of the codebook size on the extracted motifs. The two algorithms presented above have been applied on this dataset with the following parameters: = 5, minSupport = 4, maxGap = 1, minLength = 4, maxLength = 20. For CMP-Miner we set = 3 and minItem = 2. These parameter settings were chosen after extensive tests on possible value ranges.

First, sequential patterns are extracted. Then, we associate with each pattern the word in the utterances of which this pattern most often occurs. For each extracted pattern, a precision/recall score is computed. Figure 2a and 2b depict the precision/recall score versus the codebook size for MaxMotif and CMPMiner. As can be seen, MaxMotif obtains the best eciency. This gure also shows that when the codebook size increases, the precision improves slightly but not the recall.

Figure 2c shows the pattern length distribution for dierent codebook sizes for MaxMotif. For small codebooks, many long patterns are extracted. However, they are not very accurate because, being general, they can occur in many dierent sequences. For big codebooks, many pattern candidates can be found, reecting sequence variability. However, many candidates have a low support, often under the minimal threshold, and, so, less patterns are extracted.

The symbolization step is crucial. Figure 2d shows ve symbolic representations of the word cinquante for a codebook of size 15. These strings highlight the two kinds of variability (spectral and temporal) that makes the task hard for mining algorithms in this example. The same experiment was performed using the SAX symbolization method [ 2 ] on each dimension of the multidimensional times series. This representation revealed to be less accurate. Indeed, the results obtained by CMP-Miner using the SAX representation were worse. There is no space to detail these results here. 4.2

Experiment on a larger database Now, we consider a dataset containing 7 hours of audio content. The dataset is divided into 21 audio tracks coming from various radio stations. This experience is closer to a real setting.

Only MaxMotif has been tested on this dataset. The parameters were: = 4, = 80, minSupport = 40, maxGap = 1, minLength = 5, maxLength = 20. The codebook size is greater than in the previous experiment to deal with more dierent sounds. Pattern extraction is very fast: less than 4 minutes for more than one million of patterns. Some of them are interesting and correspond, for instance, to crowd noises, jingle and music patterns or short silence. However, similarly to the experiment on the synthetic dataset, only very few patterns corresponding to repeated words could be extracted. 5

Conclusion

In this paper, we have presented a preliminary work investigating how to use sequential pattern mining algorithms for audio data. The aim of this work was to evaluate whether these algorithms could be relevant for this problem. The experiments pointed out the diculty to mine audio signals, because of temporal and spectral distortion. Same words pronounced in dierent contexts and by dierent speakers can be very dierent and yield very dierent patterns. The results are promising but both symbolization and motif extraction should be improved. For instance, to account for spectral variability, considering distances between symbols should improve the overall performance of pattern extraction. (a) Precision/Recall curves for MaxMotif. (b) Precision/Recall curves for CMP-Miner with 3 sub-bands. (c) Pattern size distribution for dierent size of codebook. (d) Example of representation for a codebook of size 15.

We have also noticed that all the dimensions of the MFCC times series are not as important for the discovery. Selecting or weighting the dimensions of multidimensional time series could improve the performance too.

Mueen , Enumeration of time series motifs of all lengths , in Data Mining (ICDM) , 2013 IEEE 13th International Conference on , pp. 547556 , Dec 2013 .

Lin ,

Keogh ,

Wei , and

Lonardi , Experiencing sax: A novel symbolic representation of time series, Data Min . Knowl. Discov. , vol. 15 , pp. 107144 , Oct . 2007 .

Herley , ARGOS: Automatically extracting repeating objects from multimedia streams , IEEE Trans. on Multimedia , vol. 8 , pp. 115129 , Feb . 2006 .

Esling and

Agon , Time-series data mining , ACM Computing Surveys (CSUR) , vol. 45 , no. 1 , p. 12 , 2012 .

C. H.

Mooney and

J. F.

Roddick , Sequential pattern mining approaches and algorithms , ACM Comput. Surv. , vol. 45 , Mar . 2013 .

Park and

J. R.

Glass , Unsupervised pattern discovery in speech , IEEE Transaction on Acoustic, Speech and Language Processing , vol. 16 , pp. 186197 , Jan . 2008 .

Muscariello , G. Gravier, and

Bimbot , Audio keyword extraction by unsupervised word discovery , in INTERSPEECH 2009: 10th Annual Conference of the International Speech Communication Association , (Brighton, United Kingdom), Sept . 2009 .

J. J.

Burred , Genetic motif discovery applied to audio analysis , in International Conference on Acoustics, Speech and Signal Processing , pp. 361364 , IEEE , 2012 .

Bailey and

Elkan , Unsupervised learning of multiple motifs in biopolymers using expectation maximization , Machine Learning , vol. 21 , no. 1-2 , pp. 5180 , 1995 .

10. L. S. d. Oliveira, Z. K. do Patrocinio ,

S. J. F.

Guimarªes , and G. Gravier, Searching for near-duplicate video sequences from a scalable sequence aligner , in International Symposium on Multimedia , pp. 223226 , IEEE , 2013 .

11. M. Zaharia , W. J.

Bolosky , K.

Curtis , A.

Fox , D.

Patterson , S.

Shenker , I. Stoica ,

R. M.

Karp , and

Sittler , Faster and more accurate sequence alignment with snap , arXiv preprint arXiv:1111.5572 , 2011 .

12.

Wang ,

Megalooikonomou , and

Faloutsos , Time series analysis with multiple resolutions , Information Systems , vol. 35 , no. 1 , pp. 5674 , 2010 .

13.

Arimura and

Uno , An ecient polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence , Journal of Combinatorial Optimization , vol. 13 , 2007 .

14. A. J. Lee , H.-W. Wu, T.- Y.

Lee , Y.-H.

Liu , and K.-T. Chen, Mining closed patterns in multi-sequence time-series databases , Data & Knowledge Engineering , vol. 68 , no. 10 , pp. 1071 1090 , 2009 .

15. J. Pei , J. Han, B.

Mortazavi-Asl , H.

Pinto , Q.

Chen , U.

Dayal , and M.-C. Hsu , Prexspan: Mining sequential patterns eciently by prex-projected pattern growth , in International Conference on Data Engineering , p. 0215 , 2001 .

16.

Wang and J. Han, Bide: ecient mining of frequent closed sequences , in Data Engineering, Proceedings. 20th International Conference on Data Engineering , pp. 7990 , March 2004 .