<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Zhanjiang. Comparison of different
implementations of MFCC. Journal of Computer Science and Technology. Online. November
2001. Vol. 16</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1186/s40649-019-0069-y</article-id>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tadas Turskis</string-name>
          <email>tadas.turskis@ktu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marius Teleiša</string-name>
          <email>marius.teleisa@ktu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rūta Buckiūnaitė</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dalia Čalnerytė</string-name>
          <email>dalia.calneryte@ktu.lt</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CEUR Workshop Proceedings</institution>
          ,
          <addr-line>CEUR-WS.org</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Kaunas University of Technology</institution>
          ,
          <addr-line>Studentų g. 50, Kaunas, 51368</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Kaunas University of Technology</institution>
          ,
          <addr-line>Studentų g. 50, Kaunas, 51368</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Kaunas University of Technology</institution>
          ,
          <addr-line>Studentų g. 50, Kaunas, 51368</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Kaunas University of Technology</institution>
          ,
          <addr-line>Studentų g. 50, Kaunas, 51368</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>6</volume>
      <issue>1</issue>
      <fpage>582</fpage>
      <lpage>589</lpage>
      <abstract>
        <p>The goal of environmental sound classification is to accurately identify and classify sounds in order to provide valuable insights about the environment. The classification task can be solved by training machine learning models, such as convolutional neural networks, on a dataset of labeled sound samples. Due to the small size of available datasets in this field, time-consuming and expensive labeling process, data augmentations have become a popular practice to artificially generate additional data. The purpose of this study is to analyze whether using Mixed-Type data augmentations improves the classification performance compared to results with no augmentations. Mixed-Type data augmentation methods were evaluated on ESC-50 and UrbanSound8K datasets for the pretrained ResNet-18 model with extracted mel-frequency cepstral coefficients as feature inputs. Results for both datasets show that data augmentations can improve model performance with certain mixup probabilities and coefficients but specific methods and parameters used may vary for each dataset and task. coefficients.</p>
      </abstract>
      <kwd-group>
        <kwd>Environmental sound classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Sound classification is the process of identifying and labelling sounds based on their characteristics,
such as pitch, duration, and timbre. It is a fundamental task in the field of audio signal processing and
has numerous applications, including music information retrieval [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], speech recognition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and
environmental monitoring [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Improving the efficiency and accuracy of sound classification process
may enable more low-power devices to perform this task, help in mitigation of noise pollution, and
increase the robustness of early-warning systems, such as bee hive health monitoring systems [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        There are various approaches to sound classification, including traditional machine learning
techniques [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and more recent deep learning approaches [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Traditional methods typically involve
extracting hand-crafted features from the audio signal and using them as input to a classifier, such as a
Support Vector
      </p>
      <sec id="sec-1-1">
        <title>Machine (SVM), Decision Tree, K-Nearest Neighbor (KNN), Gaussian</title>
      </sec>
      <sec id="sec-1-2">
        <title>Mixture</title>
        <p>
          Modeling (GMM) and Hidden Markov Model (HMM) [
          <xref ref-type="bibr" rid="ref7">7–9</xref>
          ]. Due to their limited modeling capabilities,
that lead to the lack of time and frequency invariance, deep neural network-based models have been
proven to perform better in classifying environmental sounds than traditional methods [10]. Deep
learning methods involve training a neural network to learn features directly from the raw audio data. A
hybrid approach of combining both types of methods has also been explored [11].
        </p>
        <p>Deep learning models for tasks in all audio domains are limited in size and complexity due to the small
size of available datasets [12]. With the possible exception of speech recognition, environmental sound
classification (ESC) suffers from the lack of universal database [13]. Nonetheless ESC tasks are a
popular classification problem to solve therefore there have been several public datasets created.
ESCCEUR</p>
        <p>ceur-ws.org</p>
        <p>2023 Copyright for this paper by its authors.
50 [14] and UrbanSound8K (US8K) [15] are the leading datasets with the best results achieved for ESC
tasks [16]. ESC-50 is a balanced dataset published in 2015 with 2000 recordings for 50 classes of various
environmental sounds. Its subset of 10 classes ESC-10 with 400 recordings is often used as a smaller
version. US8K is a dataset published in 2014 with 8732 sound recordings of diverse city sounds. Other
occasionally used datasets are: CHIME-HOME [17] with 6138 recordings of house indoor sounds;
AudioSet [18] with over 2 million recordings of very diverse 632 classes of sounds; FSD50K [19]
unbalanced dataset with 51197 recordings of various indoor, outdoor and instrument sounds;
SONYCUST [20] with 18510 recordings of New York city sounds. It can be noted that besides the AudioSet
dataset, most of these datasets are of limited size and, as mentioned before, are too small for deep
learning models to be trained properly.</p>
        <p>
          The state-of-the-art accuracy for the ESC-50 and UrbanSound8K (US8K) datasets was 97.15% and
96% respectfully[20, 21]. Both [21, 22] equipped deep learning models. Data augmentation techniques
such as time scaling, time inversion, random crop or padding, and random noise were used in [21]. On
the other hand, [22] opted to use various feature extraction methods like NGCC [23], MFCC [24], GFCC
[25], LFCC and BFCC. An approach of using a simple CNN network without any data augmentations or
signal pre-processing on US8K dataset has demonstrated 89% of mean accuracy which outperforms most
recent solutions [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>In the domain of environmental sound, it has been noted that time-frequency representations are
especially useful as learning features due to the non-stationary and dynamic nature of the sounds [26].
These representations can be grouped into two broad categories: time-domain methods and
frequencydomain methods. Time-domain methods involve computing statistics such as the mean, standard
deviation, skewness, and variance over different time windows of the signal. Other time-domain methods
include calculation of zero-crossing rate, amplitude envelope and the root mean square energy.
Frequency-domain methods include techniques such as the calculation of the Power Spectral Density
(PSD) and the Mel-Frequency Cepstral Coefficients (MFCCs). For an increase in performance, it is
advised to combine several feature extraction methods and types of methods [27].</p>
        <p>Combination of suitable audio feature extraction, deep learning methods and data augmentation has
been proven to help boost the classification performance [28]. Data augmentation is a widely used
technique in various machine learning tasks, including environmental sound classification, to virtually
enlarge the datasets [29]. Augmentations can be divided into two general categories: image and audio
signal. Image augmentation methods include adding noise, sample pairing [30], cropping, adding filters
(e.g. blur, sharpen) [31]. Audio signal augmentations include random cropping, frequency filtering,
equalized mixture data augmentation [32], tone shifting.</p>
        <p>A recent approach is to use various mixup methods [28] to provide higher prediction accuracy and
robustness. Generally, the process of data augmentation is context and dataset dependent, which requires
expert knowledge to select augmentation methods. Mixup augmentation technique is data-agnostic, and
is performed by generating a random mixing coefficient, which is used to produce a new image and label
as a convex combination of two selected images/labels.</p>
        <p>In this paper the proposed augmentation technique is based on the mixed-example data augmentation
methods, which combine multiple examples from the training set to create a new, augmented example.
This technique aims to increase the diversity of the training data, which can lead to better generalization
and improved model performance.</p>
        <p>The rest of this paper is organized as follows. Section 2 presents the materials and methods. Section
3 describes the details about the chosen augmentation methods. The experimental results are shown in
Section 4. The results and future prospects are discussed in Section 5. Finally, the conclusions are
presented in the last Section 6.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Materials and methods</title>
    </sec>
    <sec id="sec-3">
      <title>2.1. Datasets</title>
      <p>The study focuses on two publicly available datasets for ESC, that is ESC-50 and UrbanSound8K
(US8K). These datasets consist of audio recording of various indoor and outdoor environmental sounds.
For example, the ESC-50 dataset consists of 2000 sound clips, such as animal, natural environment,
water, human produced sounds (not speech), household indoor sounds, city sounds. The UrbanSound8K
dataset consists of 8732 short sounds of various city noises people usually complain about.
2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Classification Model</title>
      <p>ResNet-18 [33] is an 18 layers deep convolutional neural network that has shown strong performance
on a variety of tasks, including image classification and object detection. It is relatively lightweight and
efficient, while still being able to capture complex patterns in the data. Additionally, ResNet18 has been
pre-trained on a large dataset (ImageNet), which means that it has already learned to recognize a wide
range of features that may be useful for environmental sound classification.
2.3.</p>
    </sec>
    <sec id="sec-5">
      <title>Data Pre-processing</title>
      <p>To apply ResNet-18 model for classification, raw audio recordings were converted to image
representation of sound as Mel-frequency cepstral coefficients (MFCCs). The scheme for feature
extraction steps is demonstrated in Figure 1.</p>
    </sec>
    <sec id="sec-6">
      <title>3. Data Augmentation Methods</title>
    </sec>
    <sec id="sec-7">
      <title>3.1. Gaussian Noise</title>
      <p>Gaussian noise augmentation is based on modifying the original image by adding the random values
generated using normal (Gaussian) distribution with a mean of 0 and a standard deviation equal to 3%
of absolute minimal value in the matrix of the original spectrogram.
3.2.</p>
    </sec>
    <sec id="sec-8">
      <title>Mixup</title>
      <p>Mixup constructs virtual training examples  ̅ from two examples   ,   drawn at random from the
training data, and mixup coefficient λ ~ U(a;b) [34] as follows:
 ̅ =    + (1 −  )  ,
(1)</p>
      <p>Mixing is done between the data of the same class label. The scheme for mixup method is shown in
Figure 2.</p>
      <p>The rest of the augmentation methods are derived from mixup with different part of the image mixed.
3.3.</p>
    </sec>
    <sec id="sec-9">
      <title>Vertical/horizontal mixup</title>
      <p>This method is used to vertically/horizontally mixup the top fraction of spectrogram image   with
the bottom fraction of image   . A cutpoint is generated by multiplying the width/height of the first
image with mixing coefficient λ. Cutpoint is a pair of row r and column c indices of an image X. The
resulting merged image is then created by mixing the top cutpoint rows/columns from both images and
selecting the bottom cutpoint rows/columns from the first image. The scheme for horizontal/vertical
mixup method is shown in Figure 3. The outlined part shows the mixed part of the image and the solid
blue part is the original   image.</p>
      <p>This method divides the images into 4 quadrants using two randomly generated height and width
cutpoints and then for each quadrant either   section or mixup of   and   with mixing coefficient λ is
used. A constraint p = 0.5 on 2x2 grid has been found to be helpful in preventing the image content from
becoming too long, narrow or missing [30]. Example random 2x2 mixup method illustration is shown in
Figure 4. The outlined part shows the part of the image that is mixed and the solid blue part is the original
  image.</p>
    </sec>
    <sec id="sec-10">
      <title>Random column/row interval</title>
      <p>This method picks a random interval of columns/rows and replaces that part of   spectrogram image
with the mixed columns/rows from   . The start and end indices are generated randomly for the
column/row interval to be mixed. Random interval method is somewhat similar to the previously
mentioned vertical/horizontal mixup method with the difference being that the random interval does not
start from the first column/row. The scheme for random column/row interval mixup method is shown in
Figure 5. The outlined part shows the part of the image that is mixed and the solid blue part is the original
  image.</p>
      <p>This method involves randomly selecting rows/columns from Xj to be mixed up. Probability of
choosing a row/column from Xj is determined by p, and for experimental testing we used p value of 0.5.
The scheme for random column/rows mixup method is shown in Figure 6. The outlined part shows the
part of the image that is mixed and the solid blue part is the original   image.</p>
    </sec>
    <sec id="sec-11">
      <title>4. Experimental Results</title>
    </sec>
    <sec id="sec-12">
      <title>4.1. Experiment setup</title>
      <p>The fixed duration of an audio sample for ESC-50 is 5 seconds, for UrbanSound8K – 4 seconds. For
each audio file in the ESC-50 and US8K datasets a log-mel spectrogram is generated. Features are
extracted from all recordings with Hamming window size of 512, hop length of 512, 128 Mel bands and
sampling rate of 44.1 kHz. The resulting spectrograms are padded or their length is fixed. Bootstrap
validation with 5 runs is performed for dataset using Stratified Shuffle Split with 0.25 test set size and
static random seed of 42. All data is then standardized according to train set. Augmentation is performed
online (when data sample is being provided to model for training).</p>
      <p>Training was performed on ResNet-18 model with weights pre-trained on ImageNet and batch size
of 16. Training was performed for 25 epochs with cases of augmentation probability of 0.3, 0.5, 1.0 and
mixup coefficient generated uniformly in the intervals of (0.2; 0.3), (0.45; 0.55), (0.7; 0.8).</p>
      <p>In total there were 76 distinct method configurations to test, that is, no augmentation, Gaussian noise
with 3 augmentation probabilities, 3 augmentation probabilities with 3 mixup coefficients and 9
augmentations utilizing mixup.</p>
      <p>Experimental testing was performed on a system provided by the Kaunas University of Technology,
and this process took almost 64h (~5mins one augment method for 25 epochs * 5 runs * 76 augmentation
methods * 2 datasets). Specifications of the system are 2x AMD EPYC 7452 32-Core Processor with
2x 256GB RAM and NVIDIA A100-PCIE-40GB GPU.
4.2.</p>
    </sec>
    <sec id="sec-13">
      <title>Results for ESC-50 dataset</title>
      <p>There are many hyperparameters to consider, so the results were gradually filtered according to their
performance. The first hyperparameter to consider is the augmentation probability. Mean results grouped
by probability of augmentation for ESC-50 dataset can be seen in Table 1.</p>
      <p>Looking at mean accuracy, results for 0.3 and 0.5 probabilities are slightly higher (0.7%) than no
augmentation, which is expected as data augmentation should improve model performance when dataset
is relatively small. Maximum accuracy of 0.87 was achieved during no augmentation, however almost
all loss metrics indicate better performance for probabilities of 0.3 and 0.5. On the other hand, probability
of 1.0 managed to degrade model performance compared to no augmentation in every metric that was
recorded, so for further analysis results for probability of 1.0 was not considered.</p>
      <p>Grouping results by mixup coefficient presents almost identical results for all values of mixup, as
shown in Table 2, however upon closer inspection small trends can be seen. Mixup coefficient of
0.70.8 means that during mixup 70-80% of original image is used, and 20-30% of random image.</p>
      <p>Most metrics show best results for mixup coefficient of 0.7-0.8, except for minimum loss and
maximum accuracy, however the differences might be due to error with this sample size so hard
conclusions cannot be drawn about the coefficient effect on model performance from this data alone.</p>
      <p>Results show that mixup coefficient of 0.7-0.8 produces slightly better results ESC-50 dataset, which
might suggest that the model prefers to have the main image “dominant” (image with higher mixup
coefficient), and not the other way around.</p>
      <p>Finally, mean results for each augmentation method can be seen in Table 3. Mixup augmentation
performs best on almost all metrics except maximum loss and Q3, which suggests that the method
produces less consistent. All proposed mixup methods performed better than the standard Gaussian Noise
augmentation method. Interestingly, all column methods (random column interval, random column,
horizontal mixup) performed better than their row counterparts.</p>
    </sec>
    <sec id="sec-14">
      <title>Results for UrbanSound8K dataset</title>
      <p>Mean results grouped by probability of augmentation for ESC-50 dataset can be seen in Table 4.</p>
      <p>Applying augmentation with a probability of 1.0 results in the worst loss and accuracy for the model.
The loss value is decreased by 5.6% when using either 0.3 or 0.5 augmentation probability compared to
no augmentation. The highest accuracy is identified for the case with no augmentation applied, although
the difference is insignificant with only a 0.001 increase compared to the best performing result with
augmentation used.</p>
      <p>The results once again indicate that using no augmentation leads to the highest accuracy, although
the difference of only 0.001% is not significant. In contrast, Table 5 shows that using mixup with
coefficients of 0.2-0.3 produces the best results in all loss metrics, resulting in a decrease of 7.2% in
mean loss compared to no augmentation. This suggests that the model has almost identical accuracy, but
with lower loss, leading to a more robust model.</p>
      <p>Looking at Table 6 random column augmentation performs best overall. Although, as seen in the
results from Table 4 and Table 5, the accuracy while using augmentation compared to no augmentation,
it is still only 0.001% higher, which is not significant. However, what is clearly seen in the previous
tables and this table in the mean loss column, is that we always get lower loss compared to no
augmentation. In this column, all used augmentation methods performed better or at least the same as no
augmentation when evaluating mean loss, with the random cols mixup method being the best, giving a
decrease of 14.4% in mean loss compared to no augmentation. This demonstrates that the models trained
with augmentation are more confident with their predictions. In addition, from Table 6 we see that the
difference between the min and max loss and accuracy of using random columns and using no
augmentation is lower, respectively 0.029 compared to 0.037 and 0.007 compared to 0.008, which means
that the trained models are more consistent and stable when applying augmentation.</p>
    </sec>
    <sec id="sec-15">
      <title>5. Discussion</title>
      <p>Column mixup methods performed better than row, which seems to suggest that having complete
frequency data is more important than full temporal data in the ESC problem.</p>
      <p>For ESC testing mixup coefficient of 0.2-0.3 and 0.7-0.8 were chosen, and one might argue that such
values produce the same set of images. This may be true for an infinite set, however, image set for a
training epoch is finite, and 0.7-0.8 mixup coefficient guarantees that the augmented set will always have
one of each sample as the “dominant” image (higher mixup coefficient), whereas with 0.2-0.3 the reverse
is true.</p>
      <p>When comparing ESC-50 with US8K datasets, we see more improvements in former, and a probable
reason could be that ESC-50 has only 40 examples per class, while US8K has up to 1000 examples per
class. For other research it could be useful and interesting to apply these augmentations for dataset with
very low number of samples, there we could see bigger improvements. For future improvements in
accuracy, a combination of various feature extraction methods could be used with our proposed methods.</p>
    </sec>
    <sec id="sec-16">
      <title>6. Conclussion</title>
      <p>Results from the ESC-50 and UrbanSound8K datasets show that data augmentation can improve
model performance, particularly when using probabilities of 0.3 or 0.5. Mixup augmentation with a
coefficient of 0.7-0.8 was found to produce the best results for the ESC-50 dataset. The random column
augmentation demonstrated the highest accuracy for the UrbanSound8K dataset. It is important to note
that the results for the UrbanSound8K dataset showed lower and less significant improvements compared
to the ESC-50 dataset. It is also worth noting that applying augmentation with probability of 100%
resulted in worst results in loss and accuracy in both datasets. Overall, these results indicate that data
augmentation can be a useful tool for improving model performance, but the specific methods and
parameters used may vary depending on the dataset and task at hand.</p>
    </sec>
    <sec id="sec-17">
      <title>7. References</title>
      <p>[8] GIANNOULIS, Dimitrios, BENETOS, Emmanouil, STOWELL, Dan, ROSSIGNOL, Mathias,
LAGRANGE, Mathieu and PLUMBLEY, Mark D. Detection and classification of acoustic scenes
and events: An IEEE AASP challenge. In : 2013 IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics. Online. New Paltz, NY, USA : IEEE, October 2013. p. 1–4.</p>
      <p>ISBN 978-1-4799-0972-8. DOI 10.1109/WASPAA.2013.6701819.
[9] ZHANG, Haomin, MCLOUGHLIN, Ian and SONG, Yan. Robust sound event recognition using
convolutional neural networks. In : 2015 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP). Online. South Brisbane, Queensland, Australia : IEEE, April 2015.
p. 559–563. ISBN 978-1-4673-6997-8. DOI 10.1109/ICASSP.2015.7178031.
[10] AL-HATTAB, Yousef Abd, ZAKI, Hasan Firdaus and SHAFIE, Amir Akramin. Rethinking
environmental sound classification using convolutional neural networks: optimized parameter
tuning of single feature extraction. Neural Computing and Applications. Online. November 2021.</p>
      <p>Vol. 33, no. 21, p. 14495–14506. DOI 10.1007/s00521-021-06091-7.
[11] ULLO, Silvia Liberata, KHARE, Smith K., BAJAJ, Varun and SINHA, G. R. Hybrid
Computerized Method for Environmental Sound Classification. IEEE Access. 2020. Vol. 8,
p. 124055–124065. DOI 10.1109/ACCESS.2020.3006082.
[12] PURWINS, Hendrik, LI, Bo, VIRTANEN, Tuomas, SCHLUTER, Jan, CHANG, Shuo-Yiin and
SAINATH, Tara. Deep Learning for Audio Signal Processing. IEEE Journal of Selected Topics in
Signal Processing. Online. May 2019. Vol. 13, no. 2, p. 206–219.</p>
      <p>DOI 10.1109/JSTSP.2019.2908700.
[13] DAVIS, Nithya and SURESH, K. Environmental Sound Classification Using Deep Convolutional
Neural Networks and Data Augmentation. In : 2018 IEEE Recent Advances in Intelligent
Computational Systems (RAICS). Online. Thiruvananthapuram, India : IEEE, December 2018.
p. 41–45. ISBN 978-1-5386-7336-2. DOI 10.1109/RAICS.2018.8635051.
[14] PICZAK, Karol J. ESC: Dataset for Environmental Sound Classification. In : Proceedings of the
23rd ACM international conference on Multimedia. Online. Brisbane Australia : ACM, 13 October
2015. p. 1015–1018. ISBN 978-1-4503-3459-4. DOI 10.1145/2733373.2806390.
[15] SALAMON, Justin, JACOBY, Christopher and BELLO, Juan Pablo. A Dataset and Taxonomy for
Urban Sound Research. In : Proceedings of the 22nd ACM international conference on
Multimedia. Online. Orlando Florida USA : ACM, 3 November 2014. p. 1041–1044. ISBN
9781-4503-3063-3. DOI 10.1145/2647868.2655045.
[16] BANSAL, Anam and GARG, Naresh Kumar. Environmental Sound Classification: A descriptive
review of the literature. Intelligent Systems with Applications. Online. 1 November 2022. Vol. 16,
p. 200115. DOI 10.1016/j.iswa.2022.200115.
[17] FOSTER, Peter, SIGTIA, Siddharth, KRSTULOVIC, Sacha, BARKER, Jon and PLUMBLEY,
Mark D. Chime-home: A dataset for sound source recognition in a domestic environment. In : 2015
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). Online.
New Paltz, NY, USA : IEEE, October 2015. p. 1–5. ISBN 978-1-4799-7450-4.</p>
      <p>DOI 10.1109/WASPAA.2015.7336899.
[18] GEMMEKE, Jort F., ELLIS, Daniel P. W., FREEDMAN, Dylan, JANSEN, Aren, LAWRENCE,
Wade, MOORE, R. Channing, PLAKAL, Manoj and RITTER, Marvin. Audio Set: An ontology
and human-labeled dataset for audio events. In : 2017 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). Online. New Orleans, LA : IEEE, March
2017. p. 776–780. ISBN 978-1-5090-4117-6. DOI 10.1109/ICASSP.2017.7952261.
[19] FONSECA, Eduardo, FAVORY, Xavier, PONS, Jordi, FONT, Frederic and SERRA, Xavier.</p>
      <p>FSD50K: An Open Dataset of Human-Labeled Sound Events. Online. 23 April 2022. arXiv.
arXiv:2010.00475. arXiv:2010.00475 [cs, eess, stat]
[20] CARTWRIGHT, Mark, CRAMER, Jason, MENDEZ, Ana, WANG, Yu, WU, Ho-Hsiang,
LOSTANLEN, Vincent, FUENTES, Magdalena, DOVE, Graham, MYDLARZ, Charlie,
SALAMON, Justin, NOV, Oded and BELLO, Juan. SONYC-UST-V2: An Urban Sound Tagging
Dataset with Spatiotemporal Context. . 2020.
[21] GUZHOV, Andrey, RAUE, Federico, HEES, Jörn and DENGEL, Andreas. AudioCLIP: Extending</p>
      <p>CLIP to Image, Text and Audio. . 2021.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>YANG</surname>
          </string-name>
          ,
          <source>Gao. Research on Music Content Recognition and Recommendation Technology Based on Deep Learning. Security and Communication Networks. 14 March</source>
          <year>2022</year>
          . Vol.
          <year>2022</year>
          . DOI 10.1155/
          <year>2022</year>
          /7696840.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>DOMINGUEZ-MORALES</surname>
          </string-name>
          ,
          <string-name>
            <surname>Juan</surname>
            <given-names>P.</given-names>
          </string-name>
          , LIU, Qian,
          <string-name>
            <surname>JAMES</surname>
          </string-name>
          , Robert, GUTIERREZ-GALAN, Daniel, JIMENEZ-FERNANDEZ, Angel,
          <string-name>
            <surname>DAVIDSON</surname>
          </string-name>
          , Simon and
          <string-name>
            <surname>FURBER</surname>
          </string-name>
          , Steve.
          <article-title>Deep Spiking Neural Network model for time-variant signals classification: a real-time speech recognition approach</article-title>
          . In : 2018
          <source>International Joint Conference on Neural Networks (IJCNN)</source>
          .
          <source>July</source>
          <year>2018</year>
          . p.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . DOI 10.1109/IJCNN.
          <year>2018</year>
          .
          <volume>8489381</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>GHANNAM</given-names>
            ,
            <surname>Ryan</surname>
          </string-name>
          <string-name>
            <given-names>B. and TECHTMANN</given-names>
            ,
            <surname>Stephen</surname>
          </string-name>
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring</article-title>
          .
          <source>Computational and Structural Biotechnology Journal. Online</source>
          .
          <year>2021</year>
          . Vol.
          <volume>19</volume>
          , p.
          <fpage>1092</fpage>
          -
          <lpage>1107</lpage>
          . DOI 10.1016/j.csbj.
          <year>2021</year>
          .
          <volume>01</volume>
          .028.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>SOARES</given-names>
            ,
            <surname>Bianca</surname>
          </string-name>
          <string-name>
            <surname>Sousa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>LUZ</given-names>
            ,
            <surname>Jederson</surname>
          </string-name>
          <string-name>
            <given-names>Sousa</given-names>
            , DE MACÊDO,
            <surname>Valderlândia</surname>
          </string-name>
          <string-name>
            <surname>Francisca</surname>
          </string-name>
          ,
          <string-name>
            <surname>SILVA</surname>
          </string-name>
          , Romuere Rodrigues Veloso e, DE ARAÚJO,
          <article-title>Flávio Henrique Duarte and MAGALHÃES, Deborah Maria Vieira</article-title>
          .
          <article-title>MFCC-based descriptor for bee queen presence detection</article-title>
          .
          <source>Expert Systems with Applications. Online. 1 September</source>
          <year>2022</year>
          . Vol.
          <volume>201</volume>
          , p.
          <fpage>117104</fpage>
          . DOI 10.1016/j.eswa.
          <year>2022</year>
          .
          <volume>117104</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>EKPEZU</given-names>
            ,
            <surname>Akon</surname>
          </string-name>
          <string-name>
            <surname>O.</surname>
          </string-name>
          ,
          <string-name>
            <surname>KATSRIKU</surname>
          </string-name>
          , Ferdinand,
          <string-name>
            <surname>YAOKUMAH</surname>
          </string-name>
          , Winfred and
          <string-name>
            <surname>WIAFE</surname>
          </string-name>
          , Isaac.
          <article-title>The Use of Machine Learning Algorithms in the Classification of Sound: A Systematic Review</article-title>
          . https://services.igi-global.com/resolvedoi/resolve.aspx?doi=10.4018/IJSSMET.298667.
          <string-name>
            <surname>Online</surname>
          </string-name>
          .
          <article-title>1 January 1AD</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>ABDOLI</surname>
          </string-name>
          , Sajjad,
          <string-name>
            <surname>CARDINAL</surname>
          </string-name>
          , Patrick and
          <string-name>
            <surname>LAMEIRAS KOERICH</surname>
          </string-name>
          ,
          <article-title>Alessandro. End-to-end environmental sound classification using a 1D convolutional neural network</article-title>
          .
          <source>Expert Systems with Applications. Online. December 2019</source>
          . Vol.
          <volume>136</volume>
          , p.
          <fpage>252</fpage>
          -
          <lpage>263</lpage>
          . DOI 10.1016/j.eswa.
          <year>2019</year>
          .
          <volume>06</volume>
          .040.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>ANWAR</given-names>
            ,
            <surname>Muhammad</surname>
          </string-name>
          <string-name>
            <surname>Zohaib</surname>
          </string-name>
          ,
          <string-name>
            <surname>KALEEM</surname>
          </string-name>
          , Zeeshan and
          <string-name>
            <given-names>JAMALIPOUR</given-names>
            ,
            <surname>Abbas</surname>
          </string-name>
          .
          <source>Machine Learning Inspired Sound-Based Amateur Drone Detection for Public Safety Applications. IEEE Transactions on Vehicular Technology. Online. March</source>
          <year>2019</year>
          . Vol.
          <volume>68</volume>
          , no.
          <issue>3</issue>
          , p.
          <fpage>2526</fpage>
          -
          <lpage>2534</lpage>
          . DOI 10.1109/TVT.
          <year>2019</year>
          .
          <volume>2893615</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>