=Paper=
{{Paper
|id=Vol-2571/CSP2019_paper_20
|storemode=property
|title= Application of Meta-Learning Methods in Recognition of Drums on the Basis of Short Soundsamples (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2571/CSP2019_paper_20.pdf
|volume=Vol-2571
|authors=Tomasz Krzywicki
|dblpUrl=https://dblp.org/rec/conf/csp/Krzywicki19
}}
== Application of Meta-Learning Methods in Recognition of Drums on the Basis of Short Soundsamples (short paper)==
Application of meta-learning methods in the recognition of drums and cymbals on the basis of short sound samples Tomasz Krzywicki Faculty of Mathematics and Computer Science University of Warmia and Mazury in Olsztyn Poland email: tomasz.krzywicki@student.uwm.edu.pl Abstract. This article presents proposal for application of Siamese neu- ral network in the process of classifying the sound of short music instru- ment samples as percussion instrument or non-percussion instrument. In the learning process 15 sound samples representing each decision classes were used. The accuracy of solution was verified by 5-fold Cross Valida- tion test. The proposed solution has achieved a satisfactory score. 1 Introduction Classification of sound files on the basis of sound is difficult. Today’s popular methods for classification process, such as deep neural networks, require large numbers of learning examples to achieve satisfactory scores. Meta-learning [8] and transfer-learning [9] methods may be useful for small sets of learning exam- ples. Motivation to create the proposed method was an attempt to use meta- learning methods in the process of classification of short music instrument sam- ples as percussion instruments being a part of basic drum kit or other instru- ments. In order to simplify the process of creating of dataset for learning, the solution should work correctly with small number of samples. The solution is based on the Siamese neural network architecture, which classifies the sound as a percussion or non-percussion instrument on the basis of two parallel inputs as samples of sound of music instrument. The proposed solution has achieved a satisfactory score. In sections 2 and 3 the Reader will be familiar with basic concepts of meta- learning approach and the most common sound processing method - MFCC. Section 4 provides information on architecture of siamese neural network, which has been used in the experiment. Section 5 contains details of preparations for the experiment in the form of the way to create dataset. In section 6 the Reader will be familiar with details of the classification model used in the experiment. Section 7.1 contains information about test of model which has been used in the experiment and accuracy obtained by the model and section 8 summaries the experiment carried out and provides informations on planned future works. Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 Meta learning Learning and meta-learning methods are used to extract knowledge from the data. Let learning process of a learning machine L will be defined by a function A(L): [4] A(L) : KL × D → M (1) where: – KL denotes the space of configuration parameters of given learning machine – L, D denotes the space of data streams (typically decision system [1]) – M denotes the space of goal models Meta learning is another or rather specific learning method. In the case of meta-learning the learning phase learn how to learn, to learn as well as possible. In other words, the target model of a meta-learning (output of meta-learning) is a configuration of learning model extracted by meta-learning algorithm. The configuration produced by meta-learning method should play the goal-role (like classifier or regressor) of meta-learning task. [4] Meta-learning can be classified in few ways, right from finding the optimal sets of weights to learning the optimizer. Currently, the term of meta-learning covers the following categories: [8] – Learning the metric space – Learning the initializations – Learning the optimizer 2.1 Learning the metric space Metric-based meta-learning process is based on learning the appropriate metric space. For example, the process is used to learn similarity between two sentences. This approach is widely used in few-shot learning, where for learning is used dataset with small number of samples in each decision classes. [8] The method of learning metric space was also used in the proposed solution. 2.2 Learning the initializations Learning the initializations process is based on trying to learn optimal initial parameter values. Classical learning (for example neural network) approach is based on initializing random parameters, calculation loss and minimizing the loss through a gradient descent in order to find optimal parameters. Meta-learning approach is based on finding optimal values of parameters with close to optimal values of parameters in order to learn model very fast. [8] 2.3 Learning the optimizer This method is based on learning the optimizer. In case of few-shot learning, gradient descent fails when training set has too small number of objects, so optimizer should be learn itself. In other words, there are two networks: a base network that actually tries to learn and a meta network that optimizes the base network. [8] 3 Sound processing The first step in any automation sound recognition is is to extract features, for example identify the components of the audio signal that are good for identifying the linguistic content and discarding all the other stuff which carries information like background noise. [7] Mel Frequency Cepstral Coefficents (MFCC) are a feature widely used in automatic speech and speaker recognition. They were introduced by Davis and Mermelstein in the 1980’s, and have been state-of-the-art ever since. Prior to the introduction of MFCCs, Linear Prediction Coefficients (LPCs) and Linear Prediction Cepstral Coefficients (LPCCs) and were the main feature type for au- tomatic speech recognition (ASR), especially with HMM (Hidden Markov Mod- els) classifiers. The procedure of converting the sound spectrum into numerical vectors by MFCC method is follows: [7] – Frame the signal into short frames – For each frame calculate the periodogram estimate [5] of the power spectrum – Apply the mel filterbank to the power spectra, sum the energy in each filter – Take the logarithm of all filterbank energies – Take the DCT of the log filterbank energies – Keep DCT coefficients 2-13, discard the rest 4 Architecture of siamese networks A siemese network is a special type of neural network most popularly used one- shot learning algorithms, so a siamese network is predominantly used in appli- cations where is small number of learning objects. Siamese networks basically consist of two symmetrical neural networks both sharing the same weights and architecture, both joined together at the end using some energy function E. The objective of siamese network is to learn metric space of similarity of two objects, for example two sound samples. [8]. In the Figure 1 you can see, that input of siamese network receives two sam- ples (sample a, sample b) in the form of tensors. The samples are then processed by each of twin networks, and their output is forwarded to an energy function which calculates the similarity (metric distance) of the two samples. Fig. 1. Siamese neural network evaluating similarity of two samples Siamese neural networks are commonly used not only for sound recognition. They are also used for face recognition, signature verification, object tracking, similar question retrieval and more. [8] 4.1 Detailed architecture of siamese neural networks The detailed architecture of siamese neural network is shown in figure 2. Fig. 2. Detailed architecture of siamese network There are two inputs: sample a and sample b. Inputs sample a and sample b are forwarded to networks: networkA and networkB respectively. Network out- puts are defined by the formula fw (samplex ) where samplex denotes input of appropriate neural network. Then outputs of networks are forwarded to energy function E, which is represented by formula: [8] Ew = (sample a, sample b) = ||fw (sample a) − fw (sample b)|| (2) 5 Dataset preparing 6 melodic instruments and 6 percussion instruments were used in the experiment. Group of melodic instruments consists of brass instruments, sound synthesizers, flute, guitar, organs, piano. Group of percussion instruments consist of basic version of a drum set: crash cymbal, hi-hat cymbal, kick drum, ride cymbal, snare drum, tom drums. In each group of instruments there are both acoustic and electronic samples of sounds. For each instrument there are 15 sound samples. The aim of this paper is to use siamese naural network to evaluate the similar- ity of sound of group of instruments in the detection of percussion instruments. Therefore the data collection will be further processing in the form of deci- sion system with the following attributes: (sample a, sample b, similarity), where: – sample a denotes tensor of a sound samples after processing by MFCC method – sample b denotes tensor of b sound samples after processing by MFCC method – decision denotes decision attribute which classifies similarity of two sound samples Detailed overview of the sample collection for further processing was pre- sented with table 1: Instrument Group of instruments Number of samples brass instruments melodic 15 crash cymbal percussion 15 sound synthesizer melodic 15 flute melodic 15 guitar melodic 15 hi-hat cymbal percussion 15 kick drum percussion 15 organs melodic 15 piano melodic 15 ride cymbal percussion 15 snare drum percussion 15 tom drums percussion 15 Table 1. Detailed overview of the instrument sample collection As similar sound samples may be consider two percussion instruments, for example ride cymbal and snare drum. As dissimilar sound samples may be con- sider percussion instrument with non-percussion instrument, for example kick drum with piano. The procedure for the selection of similar and dissimilar sound samples is as follows: 1. For each instrument of percussion instrument group (a) If iteration is even, draw another melodic instrument and its sample. Add both tensors of samples to the decision system with label 0, which means dissimilar sounds. (b) If iteration is odd, draw another percussion instrument and its sample. Add both tensors of samples to the decision system with label 1, which means similar sounds. Exemplary decision system based on this procedure were presented in table 2: In order to maintain compatibility between each sound sample of music in- strument, each tensor of sound sample has been reduced to shape (20, 400). In depending on sampling of sound sample may mean a sight difference of the sound processed. However, it does not affect quality of classification. sample a sample b similarity [[-299.0982, ...., -54.3453]] (kick drum) [[312.765, ..., 43.8856]] (ride cymbal) 1 [[19.0841, ...., 88.5388]] (hi-hat cymbal) [[99.0098, ..., 64.9856]] (piano) 0 [[24.0991, ...., 75.5542]] (crash cymbal) [[246.0558, ..., 98.5436]] (snare drum) 1 [[-132.0841, ...., 45.6430]] (tom drum) [[199.7355, ..., 99.1432]] (brass instruments) 0 Table 2. Exemplary decision system of similarity and dissimilarity of samples of sounds of music instruments 6 Classification model The siamese neural network model was used to classify the similarity of two sounds of music instruments. A single neural network (cloned for the construction of the siamese network) of neural network was constructed as follows: – input: (20, 400) shaped tensor containing vectorized spectrum of sound of music instrument – hidden layers: • Flatten layer • 128 size Dense layer with ReLU activation function • Dropout layer with a value of factor 0.1 • 128 size Dense layer with ReLU activation function • Dropout layer with a value of factor 0.1 • 128 size Dense layer with ReLU activation function • Dropout layer with a value of factor 0.1 • 64 size Dense layer with ReLU activation function • Dropout layer with a value of factor 0.1 – output: 64 size Dense layer with ReLU activation function The Euclidean distance defined as follows was used as energy function in the siamese network: [6] v u n uX d(x, y) = t (ai (x) − ai (y))2 (3) i=1 where: – d(x, y) denotes Euclidean distance in n−dimensional space of real numbers – ai (x) denotes the value of i coordinate that x objects point in n−dimensional space of real numbers – ai (y) denotes the value of i coordinate that y objects point in n−dimensional space of real numbers The full diagram of the siamese neural network used in the experiment, was presented in figure 3. Fig. 3. Full diagram of the siamese network used in the experiment 6.1 Contrastive Loss As a loss function during model training, the function of contrastive loss has been used. Contrastive Loss function is based on learning the parameters of a parametrized function in such a way that neighbors are pulled together and non-neighbors are pushed apart. Prior knowledge can be used to identify the neighbors for each training data point [3]. The function of loss of contrastive loss is defined by the following formula [8]: L = Y (E)2 + (1 − Y )max(margin − E, 0)2 (4) where: – L denotes Contrastive Loss function – Y denotes expected model predictions – E denotes energy function – margin denotes the loss function parameter, which means threshold for clas- sifying distance calculated by the energy function as similarity 7 Model accuracy test The siamese neural network model has been trained over 21 training epochs with 75% of the data in the training subset and 25% of the data in the validation subset. In order to verify the accuracy of the classification on small dataset, a 5-fold Cross Validation test has been performed. 7.1 Cross Validation k-fold cross validation is based on dividing the data set into k separated subsets, and then on repeated operations of model training on k-1 subsets and checking the accuracy on the one test subset [2]. The test subsets have to be unique. The average accuracy of all k tests and its standard deviation are result of test [1]. 7.2 The accuracy obtained by the model After applying 5-fold cross validation test, the model obtained the results shown in table 3: Subset Accuracy Standard deviation training 0.902655 0.044054 test 0.85054 0.084436 Table 3. Scores obtained by the model The accuracy obtained by the model may indicate over fitting of the model, which in this case (small number of samples in data set) may be acceptable. 8 Conclusions The key aim of this article was to performance a proposal of recognizing a short sound samples as percussion instruments or melodic instruments based on the siamese neural network. At the beginning of the article the Reader has been familiar with basic con- cepts of meta-learning approach and sound processing. Then architecture of siamese neural networks has been presented, which has been later used in the experiment. In the next step have been shown details of preparations for the experiment in the form of the way to create dataset and explanation of the clas- sification model. The suggested solution has obtained a satisfactory effectiveness confirmed by 5-fold cross validation test: 85% of accuracy. The proposed solution is the start of work on the method of percussion instruments recognition in full sound tracks. The method will aim at creation of musical notation for percussion instruments for any sound (if they will be there). In the future is planned to create sound classification models on the basis of other meta-learning methods and comparing their effectiveness with each other. Based on the effectiveness of these models further work will be carried out to create the planned objective. References 1. Artiemjew, P.: Wybrane paradygmaty sztucznej inteligencji. PJATK Publishing House, (2013) 2. Chollet, F.: Deep Learning. Praca z jzykiem Python i bibliotek Keras. Helion Pub- lishing House, (2019) 3. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality Reduction by Learning an Invari- ant Mapping, http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf 4. Jankowski, N., Duch, W., Grabczewski, K.: Meta-Learning in Computational Intel- ligence. Springer Verlag, (2011) 5. Knopov, P. ; Bila, G.: Periodogram estimates in nonlinear regression models with long-range dependent noise. Cybernetics and Systems Analysis, 2013, Vol.49(4), pp.624-631 6. Krzywicki, T.: Weather and a Part of Day Recognition in the Photos Using a kNN Methodology. Technical Sciences, 21(4) 2018, p. 291-302 7. Mel Frequency Cepstral Coefficient (MFCC) tutorial: http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel- frequency-cepstral-coefficients-mfccs/ 8. Ravichandiran, S.: Hands-On Meta Learning with Python. Packt Publishing, (2018) 9. Sarkar, D., Bali, R., Et al: Hands-On Transfer Learning with Python. Packt Pub- lishing, (2018)