Introduction

Application of meta-learning methods in the recognition of drums and cymbals on the basis of short sound samples

Tomasz Krzywicki

tomasz.krzywicki@student.uwm.edu.pl 0 0 Faculty of Mathematics and Computer Science University of Warmia and Mazury in Olsztyn Poland

This article presents proposal for application of Siamese neural network in the process of classifying the sound of short music instrument samples as percussion instrument or non-percussion instrument. In the learning process 15 sound samples representing each decision classes were used. The accuracy of solution was veri ed by 5-fold Cross Validation test. The proposed solution has achieved a satisfactory score.

Introduction

Classi cation of sound les on the basis of sound is di cult. Today's popular methods for classi cation process, such as deep neural networks, require large numbers of learning examples to achieve satisfactory scores. Meta-learning [ 8 ] and transfer-learning [ 9 ] methods may be useful for small sets of learning examples.

Motivation to create the proposed method was an attempt to use metalearning methods in the process of classi cation of short music instrument samples as percussion instruments being a part of basic drum kit or other instruments. In order to simplify the process of creating of dataset for learning, the solution should work correctly with small number of samples. The solution is based on the Siamese neural network architecture, which classi es the sound as a percussion or non-percussion instrument on the basis of two parallel inputs as samples of sound of music instrument. The proposed solution has achieved a satisfactory score.

In sections 2 and 3 the Reader will be familiar with basic concepts of metalearning approach and the most common sound processing method - MFCC. Section 4 provides information on architecture of siamese neural network, which has been used in the experiment. Section 5 contains details of preparations for the experiment in the form of the way to create dataset. In section 6 the Reader will be familiar with details of the classi cation model used in the experiment. Section 7.1 contains information about test of model which has been used in the experiment and accuracy obtained by the model and section 8 summaries the experiment carried out and provides informations on planned future works.

Meta learning

Learning and meta-learning methods are used to extract knowledge from the data. Let learning process of a learning machine L will be de ned by a function A(L): [ 4 ]

A(L) : KL

D ! M (1) { KL denotes the space of con guration parameters of given learning machine { L; D denotes the space of data streams (typically decision system [ 1 ]) { M denotes the space of goal models

Meta learning is another or rather speci c learning method. In the case of meta-learning the learning phase learn how to learn, to learn as well as possible. In other words, the target model of a meta-learning (output of meta-learning) is a con guration of learning model extracted by meta-learning algorithm. The con guration produced by meta-learning method should play the goal-role (like classi er or regressor) of meta-learning task. [ 4 ]

Meta-learning can be classi ed in few ways, right from nding the optimal sets of weights to learning the optimizer. Currently, the term of meta-learning covers the following categories: [ 8 ] { Learning the metric space { Learning the initializations { Learning the optimizer 2.1

Learning the metric space

Metric-based meta-learning process is based on learning the appropriate metric space. For example, the process is used to learn similarity between two sentences. This approach is widely used in few-shot learning, where for learning is used dataset with small number of samples in each decision classes. [ 8 ] The method of learning metric space was also used in the proposed solution. 2.2

Learning the initializations

Learning the initializations process is based on trying to learn optimal initial parameter values. Classical learning (for example neural network) approach is based on initializing random parameters, calculation loss and minimizing the loss through a gradient descent in order to nd optimal parameters. Meta-learning approach is based on nding optimal values of parameters with close to optimal values of parameters in order to learn model very fast. [ 8 ] 2.3

Learning the optimizer

This method is based on learning the optimizer. In case of few-shot learning, gradient descent fails when training set has too small number of objects, so optimizer should be learn itself. In other words, there are two networks: a base network that actually tries to learn and a meta network that optimizes the base network. [ 8 ] 3

Sound processing

The rst step in any automation sound recognition is is to extract features, for example identify the components of the audio signal that are good for identifying the linguistic content and discarding all the other stu which carries information like background noise. [ 7 ]

Mel Frequency Cepstral Coe cents (MFCC) are a feature widely used in automatic speech and speaker recognition. They were introduced by Davis and Mermelstein in the 1980's, and have been state-of-the-art ever since. Prior to the introduction of MFCCs, Linear Prediction Coe cients (LPCs) and Linear Prediction Cepstral Coe cients (LPCCs) and were the main feature type for automatic speech recognition (ASR), especially with HMM (Hidden Markov Models) classi ers. The procedure of converting the sound spectrum into numerical vectors by MFCC method is follows: [ 7 ] { Frame the signal into short frames { For each frame calculate the periodogram estimate [ 5 ] of the power spectrum { Apply the mel lterbank to the power spectra, sum the energy in each lter { Take the logarithm of all lterbank energies { Take the DCT of the log lterbank energies { Keep DCT coe cients 2-13, discard the rest 4

Architecture of siamese networks

A siemese network is a special type of neural network most popularly used oneshot learning algorithms, so a siamese network is predominantly used in applications where is small number of learning objects. Siamese networks basically consist of two symmetrical neural networks both sharing the same weights and architecture, both joined together at the end using some energy function E. The objective of siamese network is to learn metric space of similarity of two objects, for example two sound samples. [ 8 ].

In the Figure 1 you can see, that input of siamese network receives two samples (sample a, sample b) in the form of tensors. The samples are then processed by each of twin networks, and their output is forwarded to an energy function which calculates the similarity (metric distance) of the two samples.

Siamese neural networks are commonly used not only for sound recognition. They are also used for face recognition, signature veri cation, object tracking, similar question retrieval and more. [ 8 ] 4.1

Detailed architecture of siamese neural networks

The detailed architecture of siamese neural network is shown in gure 2.

There are two inputs: sample a and sample b. Inputs sample a and sample b are forwarded to networks: networkA and networkB respectively. Network outputs are de ned by the formula fw(samplex) where samplex denotes input of appropriate neural network. Then outputs of networks are forwarded to energy function E, which is represented by formula: [ 8 ]

Ew = (sample a; sample b) = jjfw(sample a) fw(sample b)jj (2) 5

Dataset preparing

6 melodic instruments and 6 percussion instruments were used in the experiment. Group of melodic instruments consists of brass instruments, sound synthesizers, ute, guitar, organs, piano. Group of percussion instruments consist of basic version of a drum set: crash cymbal, hi-hat cymbal, kick drum, ride cymbal, snare drum, tom drums. In each group of instruments there are both acoustic and electronic samples of sounds. For each instrument there are 15 sound samples.

The aim of this paper is to use siamese naural network to evaluate the similarity of sound of group of instruments in the detection of percussion instruments. Therefore the data collection will be further processing in the form of decision system with the following attributes: (sample a, sample b, similarity), where: { sample a denotes tensor of a sound samples after processing by MFCC method { sample b denotes tensor of b sound samples after processing by MFCC method { decision denotes decision attribute which classi es similarity of two sound samples

Detailed overview of the sample collection for further processing was presented with table 1: Instrument brass instruments crash cymbal sound synthesizer

ute guitar hi-hat cymbal kick drum organs piano ride cymbal snare drum tom drums

As similar sound samples may be consider two percussion instruments, for example ride cymbal and snare drum. As dissimilar sound samples may be consider percussion instrument with non-percussion instrument, for example kick drum with piano. The procedure for the selection of similar and dissimilar sound samples is as follows: 1. For each instrument of percussion instrument group (a) If iteration is even, draw another melodic instrument and its sample.

Add both tensors of samples to the decision system with label 0, which means dissimilar sounds. (b) If iteration is odd, draw another percussion instrument and its sample.

Add both tensors of samples to the decision system with label 1, which means similar sounds.

Exemplary decision system based on this procedure were presented in table 2:

In order to maintain compatibility between each sound sample of music instrument, each tensor of sound sample has been reduced to shape (20, 400). In depending on sampling of sound sample may mean a sight di erence of the sound processed. However, it does not a ect quality of classi cation. sample a sample b [[-299.0982, ...., -54.3453]] (kick drum) [[312.765, ..., 43.8856]] (ride cymbal) [[19.0841, ...., 88.5388]] (hi-hat cymbal) [[99.0098, ..., 64.9856]] (piano) [[24.0991, ...., 75.5542]] (crash cymbal) [[246.0558, ..., 98.5436]] (snare drum) [[-132.0841, ...., 45.6430]] (tom drum) [[199.7355, ..., 99.1432]] (brass instruments) Table 2. Exemplary decision system of similarity and dissimilarity of samples of sounds of music instruments

Classi cation model

The siamese neural network model was used to classify the similarity of two sounds of music instruments. A single neural network (cloned for the construction of the siamese network) of neural network was constructed as follows: { input: (20, 400) shaped tensor containing vectorized spectrum of sound of music instrument { hidden layers:

Flatten layer 128 size Dense layer with ReLU activation function Dropout layer with a value of factor 0.1 128 size Dense layer with ReLU activation function Dropout layer with a value of factor 0.1 128 size Dense layer with ReLU activation function Dropout layer with a value of factor 0.1 64 size Dense layer with ReLU activation function

Dropout layer with a value of factor 0.1 { output: 64 size Dense layer with ReLU activation function

The Euclidean distance de ned as follows was used as energy function in the siamese network: [ 6 ]

vu n d(x; y) = tuX(ai(x) i=1 ai(y))2 (3)

Fig. 3. Full diagram of the siamese network used in the experiment As a loss function during model training, the function of contrastive loss has been used. Contrastive Loss function is based on learning the parameters of a parametrized function in such a way that neighbors are pulled together and non-neighbors are pushed apart. Prior knowledge can be used to identify the neighbors for each training data point [ 3 ]. The function of loss of contrastive loss is de ned by the following formula [ 8 ]:

L = Y (E)2 + (1

Y )max(margin

E; 0)2 (4) { L denotes Contrastive Loss function { Y denotes expected model predictions { E denotes energy function { margin denotes the loss function parameter, which means threshold for classifying distance calculated by the energy function as similarity 7

Model accuracy test

The siamese neural network model has been trained over 21 training epochs with 75% of the data in the training subset and 25% of the data in the validation subset. In order to verify the accuracy of the classi cation on small dataset, a 5-fold Cross Validation test has been performed. 7.1

Cross Validation

k-fold cross validation is based on dividing the data set into k separated subsets, and then on repeated operations of model training on k-1 subsets and checking the accuracy on the one test subset [ 2 ]. The test subsets have to be unique. The average accuracy of all k tests and its standard deviation are result of test [ 1 ]. 7.2

The accuracy obtained by the model

After applying 5-fold cross validation test, the model obtained the results shown in table 3: Subset training test

Accuracy Standard deviation 0.902655 0.044054 0.85054 0.084436

Table 3. Scores obtained by the model

The accuracy obtained by the model may indicate over tting of the model, which in this case (small number of samples in data set) may be acceptable.

Conclusions

The key aim of this article was to performance a proposal of recognizing a short sound samples as percussion instruments or melodic instruments based on the siamese neural network.

At the beginning of the article the Reader has been familiar with basic concepts of meta-learning approach and sound processing. Then architecture of siamese neural networks has been presented, which has been later used in the experiment. In the next step have been shown details of preparations for the experiment in the form of the way to create dataset and explanation of the classi cation model. The suggested solution has obtained a satisfactory e ectiveness con rmed by 5-fold cross validation test: 85% of accuracy.

The proposed solution is the start of work on the method of percussion instruments recognition in full sound tracks. The method will aim at creation of musical notation for percussion instruments for any sound (if they will be there). In the future is planned to create sound classi cation models on the basis of other meta-learning methods and comparing their e ectiveness with each other. Based on the e ectiveness of these models further work will be carried out to create the planned objective.

1. Artiemjew , P. : Wybrane paradygmaty sztucznej inteligencji . PJATK Publishing House , ( 2013 )

2. Chollet , F. : Deep Learning . Praca z jzykiem Python i bibliotek Keras. Helion Publishing House , ( 2019 )

3. Hadsell , R. , Chopra , S. , LeCun, Y.: Dimensionality Reduction by Learning an Invariant Mapping , http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun- 06 .pdf

4. Jankowski , N. , Duch , W. , Grabczewski , K. : Meta-Learning in Computational Intelligence . Springer Verlag, ( 2011 )

5. Knopov , P. ; Bila, G. : Periodogram estimates in nonlinear regression models with long-range dependent noise . Cybernetics and Systems Analysis , 2013 , Vol. 49 ( 4 ), pp. 624 - 631

6. Krzywicki , T. : Weather and a Part of Day Recognition in the Photos Using a kNN Methodology . Technical Sciences , 21 ( 4 ) 2018 , p. 291 - 302

Mel

Frequency Cepstral Coe cient (MFCC) tutorial : http://practicalcryptography.com/miscellaneous/machine -learning/guide-melfrequency-cepstral-coe cients-mfccs/

8. Ravichandiran , S. : Hands-On Meta Learning with Python . Packt Publishing , ( 2018 )

9. Sarkar , D. , Bali , R. , Et al: Hands-On Transfer Learning with Python . Packt Publishing , ( 2018 )