<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Application of meta-learning methods in the recognition of drums and cymbals on the basis of short sound samples</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tomasz Krzywicki</string-name>
          <email>tomasz.krzywicki@student.uwm.edu.pl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Faculty of Mathematics and Computer Science University of Warmia and Mazury in Olsztyn Poland</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article presents proposal for application of Siamese neural network in the process of classifying the sound of short music instrument samples as percussion instrument or non-percussion instrument. In the learning process 15 sound samples representing each decision classes were used. The accuracy of solution was veri ed by 5-fold Cross Validation test. The proposed solution has achieved a satisfactory score.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Classi cation of sound les on the basis of sound is di cult. Today's popular
methods for classi cation process, such as deep neural networks, require large
numbers of learning examples to achieve satisfactory scores. Meta-learning [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
and transfer-learning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] methods may be useful for small sets of learning
examples.
      </p>
      <p>Motivation to create the proposed method was an attempt to use
metalearning methods in the process of classi cation of short music instrument
samples as percussion instruments being a part of basic drum kit or other
instruments. In order to simplify the process of creating of dataset for learning, the
solution should work correctly with small number of samples. The solution is
based on the Siamese neural network architecture, which classi es the sound as
a percussion or non-percussion instrument on the basis of two parallel inputs
as samples of sound of music instrument. The proposed solution has achieved a
satisfactory score.</p>
      <p>In sections 2 and 3 the Reader will be familiar with basic concepts of
metalearning approach and the most common sound processing method - MFCC.
Section 4 provides information on architecture of siamese neural network, which
has been used in the experiment. Section 5 contains details of preparations for
the experiment in the form of the way to create dataset. In section 6 the Reader
will be familiar with details of the classi cation model used in the experiment.
Section 7.1 contains information about test of model which has been used in the
experiment and accuracy obtained by the model and section 8 summaries the
experiment carried out and provides informations on planned future works.</p>
    </sec>
    <sec id="sec-2">
      <title>Meta learning</title>
      <p>
        Learning and meta-learning methods are used to extract knowledge from the
data. Let learning process of a learning machine L will be de ned by a function
A(L): [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
      </p>
      <p>A(L) : KL</p>
      <p>
        D ! M
(1)
{ KL denotes the space of con guration parameters of given learning machine
{ L; D denotes the space of data streams (typically decision system [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ])
{ M denotes the space of goal models
      </p>
      <p>
        Meta learning is another or rather speci c learning method. In the case of
meta-learning the learning phase learn how to learn, to learn as well as possible.
In other words, the target model of a meta-learning (output of meta-learning)
is a con guration of learning model extracted by meta-learning algorithm. The
con guration produced by meta-learning method should play the goal-role (like
classi er or regressor) of meta-learning task. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
      </p>
      <p>
        Meta-learning can be classi ed in few ways, right from nding the optimal
sets of weights to learning the optimizer. Currently, the term of meta-learning
covers the following categories: [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
{ Learning the metric space
{ Learning the initializations
{ Learning the optimizer
2.1
      </p>
      <sec id="sec-2-1">
        <title>Learning the metric space</title>
        <p>
          Metric-based meta-learning process is based on learning the appropriate metric
space. For example, the process is used to learn similarity between two sentences.
This approach is widely used in few-shot learning, where for learning is used
dataset with small number of samples in each decision classes. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] The method
of learning metric space was also used in the proposed solution.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Learning the initializations</title>
        <p>
          Learning the initializations process is based on trying to learn optimal initial
parameter values. Classical learning (for example neural network) approach is
based on initializing random parameters, calculation loss and minimizing the loss
through a gradient descent in order to nd optimal parameters. Meta-learning
approach is based on nding optimal values of parameters with close to optimal
values of parameters in order to learn model very fast. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Learning the optimizer</title>
        <p>
          This method is based on learning the optimizer. In case of few-shot learning,
gradient descent fails when training set has too small number of objects, so
optimizer should be learn itself. In other words, there are two networks: a base
network that actually tries to learn and a meta network that optimizes the base
network. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Sound processing</title>
      <p>
        The rst step in any automation sound recognition is is to extract features, for
example identify the components of the audio signal that are good for identifying
the linguistic content and discarding all the other stu which carries information
like background noise. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
      </p>
      <p>
        Mel Frequency Cepstral Coe cents (MFCC) are a feature widely used in
automatic speech and speaker recognition. They were introduced by Davis and
Mermelstein in the 1980's, and have been state-of-the-art ever since. Prior to
the introduction of MFCCs, Linear Prediction Coe cients (LPCs) and Linear
Prediction Cepstral Coe cients (LPCCs) and were the main feature type for
automatic speech recognition (ASR), especially with HMM (Hidden Markov
Models) classi ers. The procedure of converting the sound spectrum into numerical
vectors by MFCC method is follows: [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
{ Frame the signal into short frames
{ For each frame calculate the periodogram estimate [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of the power spectrum
{ Apply the mel lterbank to the power spectra, sum the energy in each lter
{ Take the logarithm of all lterbank energies
{ Take the DCT of the log lterbank energies
{ Keep DCT coe cients 2-13, discard the rest
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Architecture of siamese networks</title>
      <p>
        A siemese network is a special type of neural network most popularly used
oneshot learning algorithms, so a siamese network is predominantly used in
applications where is small number of learning objects. Siamese networks basically
consist of two symmetrical neural networks both sharing the same weights and
architecture, both joined together at the end using some energy function E. The
objective of siamese network is to learn metric space of similarity of two objects,
for example two sound samples. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>In the Figure 1 you can see, that input of siamese network receives two
samples (sample a, sample b) in the form of tensors. The samples are then processed
by each of twin networks, and their output is forwarded to an energy function
which calculates the similarity (metric distance) of the two samples.</p>
      <p>
        Siamese neural networks are commonly used not only for sound recognition.
They are also used for face recognition, signature veri cation, object tracking,
similar question retrieval and more. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
4.1
      </p>
      <sec id="sec-4-1">
        <title>Detailed architecture of siamese neural networks</title>
        <p>The detailed architecture of siamese neural network is shown in gure 2.</p>
        <p>
          There are two inputs: sample a and sample b. Inputs sample a and sample b
are forwarded to networks: networkA and networkB respectively. Network
outputs are de ned by the formula fw(samplex) where samplex denotes input of
appropriate neural network. Then outputs of networks are forwarded to energy
function E, which is represented by formula: [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
        </p>
        <p>Ew = (sample a; sample b) = jjfw(sample a)
fw(sample b)jj
(2)
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Dataset preparing</title>
      <p>6 melodic instruments and 6 percussion instruments were used in the experiment.
Group of melodic instruments consists of brass instruments, sound synthesizers,
ute, guitar, organs, piano. Group of percussion instruments consist of basic
version of a drum set: crash cymbal, hi-hat cymbal, kick drum, ride cymbal,
snare drum, tom drums. In each group of instruments there are both acoustic and
electronic samples of sounds. For each instrument there are 15 sound samples.</p>
      <p>The aim of this paper is to use siamese naural network to evaluate the
similarity of sound of group of instruments in the detection of percussion instruments.
Therefore the data collection will be further processing in the form of
decision system with the following attributes: (sample a, sample b, similarity),
where:
{ sample a denotes tensor of a sound samples after processing by MFCC
method
{ sample b denotes tensor of b sound samples after processing by MFCC
method
{ decision denotes decision attribute which classi es similarity of two sound
samples</p>
      <p>Detailed overview of the sample collection for further processing was
presented with table 1:
Instrument
brass instruments
crash cymbal
sound synthesizer</p>
      <p>ute
guitar
hi-hat cymbal
kick drum
organs
piano
ride cymbal
snare drum
tom drums</p>
      <p>As similar sound samples may be consider two percussion instruments, for
example ride cymbal and snare drum. As dissimilar sound samples may be
consider percussion instrument with non-percussion instrument, for example kick
drum with piano. The procedure for the selection of similar and dissimilar sound
samples is as follows:
1. For each instrument of percussion instrument group
(a) If iteration is even, draw another melodic instrument and its sample.</p>
      <p>Add both tensors of samples to the decision system with label 0, which
means dissimilar sounds.
(b) If iteration is odd, draw another percussion instrument and its sample.</p>
      <p>Add both tensors of samples to the decision system with label 1, which
means similar sounds.</p>
      <p>Exemplary decision system based on this procedure were presented in table
2:</p>
      <p>In order to maintain compatibility between each sound sample of music
instrument, each tensor of sound sample has been reduced to shape (20, 400). In
depending on sampling of sound sample may mean a sight di erence of the sound
processed. However, it does not a ect quality of classi cation.
sample a sample b
[[-299.0982, ...., -54.3453]] (kick drum) [[312.765, ..., 43.8856]] (ride cymbal)
[[19.0841, ...., 88.5388]] (hi-hat cymbal) [[99.0098, ..., 64.9856]] (piano)
[[24.0991, ...., 75.5542]] (crash cymbal) [[246.0558, ..., 98.5436]] (snare drum)
[[-132.0841, ...., 45.6430]] (tom drum) [[199.7355, ..., 99.1432]] (brass instruments)
Table 2. Exemplary decision system of similarity and dissimilarity of samples of sounds
of music instruments</p>
    </sec>
    <sec id="sec-6">
      <title>Classi cation model</title>
      <p>The siamese neural network model was used to classify the similarity of two
sounds of music instruments. A single neural network (cloned for the construction
of the siamese network) of neural network was constructed as follows:
{ input: (20, 400) shaped tensor containing vectorized spectrum of sound of
music instrument
{ hidden layers:</p>
      <p>Flatten layer
128 size Dense layer with ReLU activation function
Dropout layer with a value of factor 0.1
128 size Dense layer with ReLU activation function
Dropout layer with a value of factor 0.1
128 size Dense layer with ReLU activation function
Dropout layer with a value of factor 0.1
64 size Dense layer with ReLU activation function</p>
      <p>Dropout layer with a value of factor 0.1
{ output: 64 size Dense layer with ReLU activation function</p>
      <p>
        The Euclidean distance de ned as follows was used as energy function in the
siamese network: [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
      </p>
      <p>vu n
d(x; y) = tuX(ai(x)
i=1
ai(y))2
(3)</p>
      <p>
        Fig. 3. Full diagram of the siamese network used in the experiment
As a loss function during model training, the function of contrastive loss has
been used. Contrastive Loss function is based on learning the parameters of
a parametrized function in such a way that neighbors are pulled together and
non-neighbors are pushed apart. Prior knowledge can be used to identify the
neighbors for each training data point [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The function of loss of contrastive loss
is de ned by the following formula [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]:
      </p>
      <p>L = Y (E)2 + (1</p>
      <p>Y )max(margin</p>
      <p>E; 0)2
(4)
{ L denotes Contrastive Loss function
{ Y denotes expected model predictions
{ E denotes energy function
{ margin denotes the loss function parameter, which means threshold for
classifying distance calculated by the energy function as similarity
7</p>
    </sec>
    <sec id="sec-7">
      <title>Model accuracy test</title>
      <p>The siamese neural network model has been trained over 21 training epochs with
75% of the data in the training subset and 25% of the data in the validation
subset. In order to verify the accuracy of the classi cation on small dataset, a
5-fold Cross Validation test has been performed.
7.1</p>
      <sec id="sec-7-1">
        <title>Cross Validation</title>
        <p>
          k-fold cross validation is based on dividing the data set into k separated subsets,
and then on repeated operations of model training on k-1 subsets and checking
the accuracy on the one test subset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The test subsets have to be unique. The
average accuracy of all k tests and its standard deviation are result of test [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
7.2
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>The accuracy obtained by the model</title>
        <p>After applying 5-fold cross validation test, the model obtained the results shown
in table 3:
Subset
training
test</p>
        <p>Accuracy Standard deviation
0.902655 0.044054
0.85054 0.084436</p>
        <p>Table 3. Scores obtained by the model</p>
        <p>The accuracy obtained by the model may indicate over tting of the model,
which in this case (small number of samples in data set) may be acceptable.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Conclusions</title>
      <p>The key aim of this article was to performance a proposal of recognizing a short
sound samples as percussion instruments or melodic instruments based on the
siamese neural network.</p>
      <p>At the beginning of the article the Reader has been familiar with basic
concepts of meta-learning approach and sound processing. Then architecture of
siamese neural networks has been presented, which has been later used in the
experiment. In the next step have been shown details of preparations for the
experiment in the form of the way to create dataset and explanation of the
classi cation model. The suggested solution has obtained a satisfactory e ectiveness
con rmed by 5-fold cross validation test: 85% of accuracy.</p>
      <p>The proposed solution is the start of work on the method of percussion
instruments recognition in full sound tracks. The method will aim at creation of
musical notation for percussion instruments for any sound (if they will be there).
In the future is planned to create sound classi cation models on the basis of other
meta-learning methods and comparing their e ectiveness with each other. Based
on the e ectiveness of these models further work will be carried out to create
the planned objective.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Artiemjew</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Wybrane paradygmaty sztucznej inteligencji</article-title>
          .
          <source>PJATK Publishing House</source>
          , (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chollet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Deep Learning</article-title>
          .
          <source>Praca z jzykiem Python i bibliotek Keras. Helion Publishing House</source>
          , (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hadsell</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chopra</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , LeCun, Y.:
          <article-title>Dimensionality Reduction by Learning an Invariant Mapping</article-title>
          , http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-
          <volume>06</volume>
          .pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Jankowski</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duch</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grabczewski</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <source>Meta-Learning in Computational Intelligence</source>
          . Springer Verlag, (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Knopov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; Bila,
          <string-name>
            <surname>G.</surname>
          </string-name>
          :
          <article-title>Periodogram estimates in nonlinear regression models with long-range dependent noise</article-title>
          .
          <source>Cybernetics and Systems Analysis</source>
          ,
          <year>2013</year>
          , Vol.
          <volume>49</volume>
          (
          <issue>4</issue>
          ), pp.
          <fpage>624</fpage>
          -
          <lpage>631</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Krzywicki</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Weather and a Part of Day Recognition in the Photos Using a kNN Methodology</article-title>
          .
          <source>Technical Sciences</source>
          ,
          <volume>21</volume>
          (
          <issue>4</issue>
          )
          <year>2018</year>
          , p.
          <fpage>291</fpage>
          -
          <lpage>302</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Mel</given-names>
            <surname>Frequency</surname>
          </string-name>
          <article-title>Cepstral Coe cient (MFCC) tutorial</article-title>
          : http://practicalcryptography.com/miscellaneous/machine
          <article-title>-learning/guide-melfrequency-cepstral-coe cients-mfccs/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ravichandiran</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Hands-On Meta Learning with Python</article-title>
          .
          <source>Packt Publishing</source>
          , (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sarkar</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bali</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Et al:
          <article-title>Hands-On Transfer Learning with Python</article-title>
          .
          <source>Packt Publishing</source>
          , (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>