=Paper= {{Paper |id=Vol-2718/paper04 |storemode=property |title=Comparison of Dimensionality Reduction Techniques on Audio Signals |pdfUrl=https://ceur-ws.org/Vol-2718/paper04.pdf |volume=Vol-2718 |authors=Tamás Pál,Dániel T. Várkonyi |dblpUrl=https://dblp.org/rec/conf/itat/PalV20 }} ==Comparison of Dimensionality Reduction Techniques on Audio Signals== https://ceur-ws.org/Vol-2718/paper04.pdf
       Comparison of Dimensionality Reduction Techniques on Audio Signals

                                                      Tamás Pál, Dániel T. Várkonyi

                    Eötvös Loránd University, Faculty of Informatics, Department of Data Science and Engineering,
                  Telekom Innovation Laboratories, Budapest, Hungary {evwolcheim, varkonyid}@inf.elte.hu
                                           WWW home page: http://t-labs.elte.hu

Abstract: Analysis of audio signals is widely used and                    this work: car horn, dog bark, engine idling, gun shot, and
very effective technique in several domains like health-                  street music [5].
care, transportation, and agriculture. In a general process                  Related work is presented in Section 2, basic mathe-
the output of the feature extraction method results in huge               matical notation used is described in Section 3, while the
number of relevant features which may be difficult to pro-                different methods of the pipeline are briefly presented in
cess. The number of features heavily correlates with the                  Section 4. Section 5 contains data about the evaluation
complexity of the following machine learning method. Di-                  methodology, Section 6 presents the results and conclu-
mensionality reduction methods have been used success-                    sions are formulated in Section 7.
fully in recent times in machine learning to reduce com-                     The goal of this paper is to find a combination of feature
plexity and memory usage and improve speed of following                   extraction and dimensionality reduction methods which
ML algorithms. This paper attempts to compare the state                   can be most efficiently applied to audio data visualization
of the art dimensionality reduction techniques as a build-                in 2D and preserve inter-class relations the most.
ing block of the general process and analyze the usability
of these methods in visualizing large audio datasets.
                                                                          2   Related Work
1    Introduction                                                         A large number of research addresses the problem of map-
                                                                          ping a collection of sounds into a 2-dimensional map and
With recent advances in machine learning, audio, speech,                  examining the results both visually and analytically.
and music processing has evolved greatly, with many ap-                      Roma et al. [2] used MFCC and autoencoders as feature
plications like searching, clustering, classification, and                extraction methods and compared PCA, tSNE, Isomap,
tagging becoming more efficient and robust through ma-                    MDS, SOM and the Fruchterman-Reingold dimensional-
chine learning techniques. Well performing dimension-                     ity reduction to map the output of the feature extraction
ality reduction methods have been successfully applied                    process into 2D.
in many areas of computer science and interdisciplinary                      Fedden [7] used a very similar approach, with only
fields.                                                                   different methods tried. As far as feature extraction is
   A very important phenomenon that justifies the use of                  concerned, MFCCs and an autoencoder-like architecture
such methods is the curse of dimensionality. The ba-                      called NSynth were used. PCA, t-SNE, UMAP methods
sic idea is that the increase of dimensionality increases                 were used for dimensionality reduction.
the volume of space, causing the data to become sparse.                      Hantrakul et al. [8] implements the usual pipeline of
Therefore, the amount of data required for a model to                     applying feature extraction, reshaping, then dimensional-
achieve high accuracy and efficiency increases greatly.                   ity reduction into 2D. STFT, MFCC, high-level features
   Considering audio signals, meaningful features need to                 and the Wavenet encoder were used for feature extrac-
be extracted before applying any dimensionality reduction                 tion. PCA, UMAP and t-SNE were the candidates for the
method. Using methods for feature extraction like Short                   next step. The dataset consisted of different drum sample
Time Fourier Transformation (STFT), Mel-frequency cep-                    sounds. The results were analyzed through external clus-
stral coefficient (MFCC), or high-level descriptors like                  tering metrics like homogeneity, completeness score and
zero-crossing-rate, representative data can be extracted                  V-measure, using k-means as clustering method.
from the audio data. After extraction these features can                     Dupont et al. [11] used MFCCs combined with spec-
be projected into two-dimensional space for cluster analy-                tral flatness to represent the extracted information from the
sis and examination of class separation.                                  audio. PCA, Isomap and t-SNE were chosen for dimen-
   The dataset used is the UrbanSound 8k dataset, contain-                sionality reduction. In addition, supervised dimensional-
ing 8732 labeled sound excerpts (≤ 4 seconds) of urban                    ity reduction methods were included as well, like LDA
sounds from 10 classes, from which 5 have been used in                    (Linear Discriminant Analysis) and HDA (Heteroscedas-
                                                                          tic Discriminant Analysis).
      Copyright c 2020 for this paper by its authors. Use permitted un-
                                                                             Charles Lo’s work [1] is focused on dimensionality re-
der Creative Commons License Attribution 4.0 International (CC BY         duction for music feature extraction. The author used fea-
4.0).                                                                     ture vectors composed of: 13 MFCCs and high-level fea-
tures. Locally Linear Embedding (LLE), Autoencoder, t-                 1. Low: 1 Hz, High: 10000 Hz
SNE and PCA were used to map into lower dimensions.
The dataset contained 1000 song clips equally distributed              2. Low: 25 Hz, High: 9000 Hz
from 10 musical genres. In order to test the results, Gaus-            3. Low: 50 Hz, High: 8500 Hz
sian mixture models were implemented to test cluster pu-
rity and supervised classification was also performed with             4. Low: 100 Hz, High: 8000 Hz
kNN classifier. tSNE achieved the best classification per-
                                                                       5. Low: 150 Hz, High: 7000 Hz
formance.

                                                                      4.2   Feature Extraction
3     Basic Notation
                                                                      In order to extract meaningful data, that can be interpreted
Given a dataset of discrete audio signals, we de-                     by machine learning models, the step of feature extraction
note the i-th recording as x(i) and the dataset as                    is required. This step also acts as a dimensionality reduc-
X = {xx(1) ,xx(2) , ...,xx(N) }. The corresponding ground-            tion process, because it transforms a sound segment repre-
truth labels are stored in Y = {y(1) , y(2) , ..., y(N) }. A single   sented by hundreds of thousands of samples to a data struc-
audio element of the dataset is comprised of a number of              ture containing only a few thousand values or less. This is
                      (i) (i)    (i)
samples: x(i) = (x1 , x2 , ...xM ).                                   a crucial step as it enables to apply dimensionality reduc-
                                                                      tion methods later that will ultimately project the sample
                                                                      into the lowest dimensionality of just 2 components.

4     Methods
                                                                      STFT The Short-time Fourier transform (STFT) cap-
The building blocks of the process used in this work are              tures both temporal and spectral information from a sig-
presented in the following sections and are also shown on             nal. As a result, the STFT transforms the signal from
figure 1. As seen on the figure, a set of discrete audio sig-         time-amplitude representation to time-frequency represen-
nals are transformed trough consecutive methods to result             tation. Formally, the STFT is calculated by applying the
in a 2D map of datapoints.                                            discrete Fourier transform with a sliding window to over-
                                                                      lapping segments of the signal. It can be formulated in the
                                                                      following way:
                                                                                                     L−1                       2π
                                                                        ST FT {xx(i) } := S(i) [k, n] = ∑ x(i) [n + m]w[m]e− jk L m
                                                                                                     m=0
                                                                                                                              (1)
                                                                      where w[m] is the window function, L is the window
                                                                      length, n denotes the frame number and k denotes the fre-
                                                                      quency in question.
                                                                         The STFT returns a complex-valued matrix, where
                                                                      |S[n, k]| is the magnitude of the frequency bin k at time-
                                                                      frame n.

                                                                      Mel-scaled STFT As human perception of pitch (fre-
                Figure 1: Steps of the process                        quency) is logarithmic, the Mel-scale is often applied to
                                                                      the frequency axis of the STFT output. Specifically, the
                                                                      Mel-scale is linearly spaced below 1000 Hz and turns log-
4.1   Band-pass filtering                                             arithmic afterward.
                                                                         The Mel-scaled frequency value can be computed from
Applying filtering of some kind before feature extraction             a frequency( f ) in Hertz using the following equation:
could be justifiable to remove frequency intervals that usu-
ally do not have discriminative power. The simplest such                                                           f
                                                                                   Mel( f ) = 2595 · log10 (1 +       )             (2)
solution is applying a band-pass filter. A band-pass filter                                                       700
passes frequencies between a lower limit fL and a higher
limit fH , while rejecting frequencies outside this interval.         MFCC Mel Frequency Cepstral coefficients are widely
Band-pass filters are usually implemented using Butter-               used in audio and speech recognition, and were introduced
worth filters, as these have maximally flat frequency re-             by Paul Mermelstein in 1976 [10]. Overview of the pro-
sponse in the passband.                                               cess:
   The following low-cutoff and high-cutoff frequency
pairs have been used as band-pass filter specifications:               1. Take the Short-time Fourier transform of the signal.
 2. Apply the Mel filterbank to the power spectra of each               • Spectral bandwidth:
    frame, summing the energies in each frequency band.                                         s
    The Mel filterbank is a set of overlapping triangular                                           ∑K−1              2
                                                                                                     k=0 (k − vSC (n)) · |S[k, n]|
                                                                                                                                  2
    filters.                                                                       vSS (n) =                K−1
                                                                                                                                            (8)
                                                                                                           ∑k=0 |S[k, n]|2
 3. Take the logarithm of the calculated energies.
                                                                             The spectral bandwidth is a measure of the average
 4. Take the Discrete Cosine Transform (DCT) of the log                      spread of the spectrum in relation to its centroid.
    filterbank energies, resulting in a so-called cepstrum.
    Intuitively, the cepstrum captures information of the               • Spectral flatness:
    rate of change in frequency bands. The DCT is re-                                                       q
    lated to the DFT (Discrete Fourier Transform). While                                                    K
                                                                                                              ∏K−1
                                                                                                               k=0 |S[k, n]|
    DFT uses both sine and cosine functions to express a                                   vSF (n) =        1 K−1
                                                                                                                                            (9)
    function as a sum of these sinusoids, the DCT uses                                                      K ∑k=0 |S[k, n]|
    only cosine functions. Formally (using the DCT-II):
                                                                             The spectral flatness is the ratio of geometric mean
              N−1          
                             π      1
                                                                            and arithmetic mean of the magnitude spectrum.
      C[k] = ∑ x[n] · cos      (n + )k ,        1≤k≤N
              n=0            N      2
                                                         (3)          4.3    Reshaping methods
 5. The MFCC coefficients are the resulting DCT coeffi-
                                                                      After the feature extraction step, the feature set takes the
    cients.
                                                                      (N, k, n) form, where N is the size of the dataset, k denotes
                                                                      the number of frequency bins (or features) and n is the
High-level features High level descriptors that capture               number of time frames. To be able to perform dimension-
temporal or spectral information about the signal are often           ality reduction on the whole dataset, the feature set needs
used in similar experiments. The following six features               to be transformed to a (N, d) form, where d is the size of
have been used for frame number n:                                    a feature vector after the reshaping transformation. Three
                                                                      such methods have been found in associated papers:
  • Root mean square:
                                 s
                                     ∑L−1
                                      m=0 x[n + m]
                                                  2
                                                                      Flattening Stacking consecutive rows horizontally to
                   vRMS (n) =                                   (4)
                                           L                          form one, long 1-dimensional vector. Using this method, a
                                                                      feature set of shape (k, n) will take the (1, k × n) form after
     where L is the window length.                                    the reshaping operation. Used by [8].
  • Zero crossing rate:                                                                                                            
                                                                                                        S11      S12   S13     ...
                1 L−1                                                                                  S21      S22   S23     . . .
                                                                                                                                     =⇒
     vZC (n) =     ∑ |sgn(x[n + m]) − sgn(x[n + m − 1])|                                                 ..
                                                                                                       
               2L m=0                                                                                                  ..
                                                                                                          .               .
                                                       (5)                                                                             
                                                                             S11   S12    S13     . . . S1N     S21    S22     S23   ...
     where L is the window length. The zero crossing rate
     is the number of changes of sign in consecutive audio
     samples.                                                         Aggregation of features over frames: mean, std By
                                                                      calculating some statistical features across all frames, the
  • Spectral centroid:
                                                                      spectrogram can be reduced to one feature vector, where
                                 ∑K−1
                                  k=0 k · |S[k, n]|
                                                   2                  statistical values (µS - mean of random variable, σS - stan-
                    vSC (n) =                                   (6)   dard deviation of random variable) calculated for each fre-
                                  ∑K−1
                                   k=0 |S[k, n]|
                                                 2
                                                                      quency band are stacked vertically in a continuous fashion.
     The spectral centroid indicates where the center of              Using this method, a feature set of shape (k, n) will take
     mass of the spectrum is located.                                 the (k × A, 1) form after the reshaping operation, where A
                                                                      is the number of statistics calculated. Used by [2], [7],
  • Spectral rolloff:                                                 [11].
                                                                                                                        
                 vSR (n) = i                                    (7)                                                     µS1
                               ∑ik=0 |S[k,n]|=κ ∑K−1
                                                                                                                 
                                                 k=0 |S[k,n]|                       S11     S12      S13     ...       σS1 
                                                                                                                        
                                                                                   S21     S22      S23     . . .
                                                                                                                   =⇒ µS2 
                                                                                                                        
                                                                                                                                        (10)
     Spectral rolloff is the frequency below which a spec-
                                                                                     ..
                                                                                   
                                                                                                     ..                σS2 
     ified percentage of the total spectral energy lies, with                         .                 .
                                                                                                                         ..
                                                                                                                        
     κ denoting the percentage score.                                                                                     .
Aggregation of features over frames: mean, std, min,            t-Stochastic Neighbor Embedding (TSNE) t-SNE is a
max Similar to the solution above, but with additional          nonlinear dimensionality reduction method developed by
minimum and maximum values calculated for each fre-             Laurens van der Maaten and Geoffrey Hinton in 2008 [9].
quency bin. Used by [1].                                        t-SNE has the ability to preserve local structure by finding
                                                                a distribution that models the neighboring relations of
4.4   Dimensionality Reduction Methods                          the data in the original space and mapping this into
                                                                lower dimensional embedding (ie.: finding a lower dimen-
Once the data is reshaped, dimensionality reduction is ap-      sional distribution) while preserving local spatial relations.
plied in order to project the data into a lower dimensional
space.                                                             Using this probabilistic approach, the first main idea is
                                                                that to convert Euclidean distances into conditional prob-
Principal Component Analysis (PCA) Principal Com-               abilities that reflect these spatial similarities. Let P be the
ponent Analyis is a basis transformation that results in a      joint probability distribution of the data in high dimen-
set of orthogonal basis vectors along which the variance        sional space, represented by a normal distribution:
of the data is maximal.
                                                                                         1                                 kxx(i) −xx( j) k2
                                                                 p j|i =                                          exp(−                      ) (14)
  Calculating PCA involves the following steps:                            ∑k6=i exp(−
                                                                                           kxx(i) −xx(k) k2
                                                                                                              )                  2σi2
                                                                                                    2σi2
 1. Denote the reshaped dataset with X ∈ RN×d .
                                                                Intuitively, this expresses the conditional probability of xi
 2. Standardization: subtract the mean, and divide by the       picking x j as its neighbor if neighbors were chosen in pro-
    standard deviation for each feature column. Each col-       portion to their density around xi .
    umn will a have a mean of 0 and a standard deviation           Student t-distribution (with one degree of freedom) Q is
    of 1.                                                       used to model the distribution of the low dimensional data:
 3. Form the covariance matrix of X.
                                                                                           (1 + ky(i) − y( j) k2 )−1
      The variance-covariance matrix consists of the vari-                     qi| j =                                                       (15)
                                                                                         ∑k6=i (1 + ky(i) − y(k) k2 )−1
      ances of the variables along the main diagonal and
      the covariances between each pair of variables in the        Due to difficult optimization, t-SNE introduces a modi-
      other matrix positions. The covariance matrix is sym-     fied version of the P distribution, a symmetric probability
      metric, thus it’s diagonalizable.                         density function, defined as:
      The covariance matrix is calculated from X using the                                                 pi| j + p j|i
      following formula:                                                                     pi j =                                          (16)
                                                                                                                2n
                       1
                 C=       X T X,       C ∈ Rd×d         (11)       t-SNE uses gradient descent to minimise the error be-
                      n−1                                       tween the two distributions and adds a decaying momen-
 4. Compute the eigenvectors and eigenvalues of the co-         tum term as an optimization. The authors introduce the
    variance matrix. Formally, the eigendecomposition           Kullback-Leibler (KL) divergence as the cost (loss) func-
    of a matrix is described as:                                tion:
                                                                                                        pi j
                                                                             C = KL(P||Q) = ∑ pi j log               (17)
                          C = V DV −1                   (12)                                   i6= j    qi j

      where V is a square matrix whose ith columns repre-       The gradient of the cost function is the following:
      sents the ith eigenvector of C. D is a diagonal matrix
                                                                                  ∂C
      whose diagonal elements represent the eigenvalues of                             = 4 ∑(pi j − qi j )(yi − y j )                        (18)
      C.                                                                          ∂ yi     j

      These eigenvectors are called the principal compo-          In essence, the t-SNE algorithm is based on the follow-
      nents of the dataset, ie. the axis on which the data is   ing steps:
      more dispersed.
                                                                  1. Compute pairwise similarities p j|i .
 5. Sort the eigenvalues and the corresponding eigenvec-
    tors in decreasing order.                                                      pi| j +p j|i
                                                                  2. Let pi j =         2n      .
 6. Leave the first d 0 number of eigenvalues and their cor-
                                  0                               3. Randomly sample the low dimensional points from a
    responding eigenvectors. V = Vi j , 1 ≤ i ≤ d, 1 ≤
         0                                                           Gaussian distribution with small variance (ie.: Y ∼
    j≤d
                                                                     N(0, 10−4 I)).
 7. The transformed dataset will be:
                                                                  4. Repeat the following steps for T number of iterations:
                                   0            0
                   Xlow_dim = XV ,     ∈ RN×d           (13)         for t=1 to T do:
      (a) Compute low-dimensional similarities qi j .          Self Organizing Map (SOM) Self Organizing Maps
      (b) Compute gradient ∂C                                  (SOMs) were introduced by Finnish professor Teuvo Ko-
                           ∂y .
                                                               honen in 1982 [12]. They are a type of Artificial Neural
      (c) Update y(t) = y(t−1) + η ∂C
                                   ∂ y + α(t)(y
                                               (t−1) −
                                                               Network, that learns a two-dimensional representation of
          y(t−2) ), where η is the learning rate, α the        an input data of arbitrary dimension through an unsuper-
          decaying weight coefficient of the momentum          vised learning process. Architecturally, SOMs contain two
          term.                                                types of neurons (nodes): input and map neurons. The
                                                               number of input nodes correspond to the dimensionality
   An important hyperparameter of the t-SNE method             of the data, while the number of map nodes are arbitrar-
is the perplexity. According to the authors, perplexity        ily chosen, and are arranged into a rectangular grid (map).
is the measure of the assumed number of neighbors,             Every input node is connected to every map node, with a
i.e. it balances the attention between the local and the       certain weight wi j associated to it.
global structure of the dataset. While typical values are      The learning process can be summarized in the following
situated between 1 and 50, the optimal perplexity value        way:
is a function of the dataset size, thus manual tuning is
                                                                   1. Randomly initialize all weights.
necessary.
                                                                   2. Pick one random sample from the dataset.
                                                                   3. For each node of the map, calculate the Euclidean
Isomap Isomap is a nonlinear dimensionality re-                       distance between the input vector and the node’s as-
duction method, belonging to the group of manifold                    sociated weight vector:
learning methods [4]. It assumes that the data cannot
                                                                                                 d0
be well represented in a subspace, but it can by a manifold.                             (i)          (i)
                                                                                    d j (xx ) = ∑ (xk − w jk )2            (22)
                                                                                                k=1
  The steps of the Isomap algorithm are:
                                                                      where d 0 is the dimensionality of the input data, x(i) is
 1. Construct a weighted neighborhood graph using k-                  the input vector and w jk represents the weight asso-
    NN or ε-radius and assign the Euclidean distances                 ciated to the connection between the j-th map node
    between the nodes to the edge weights.                            and the k-th element of the input vector. The node
 2. Compute the shortest path between each node us-                   that produces the smallest distance is called the BMU
    ing the Floyd-Warshall or Dijkstra’s algorithm. Con-              (Best Matching Unit).
    struct a distance matrix D from these pairwise                 4. The weights of the BMU, as well as the weights of its
    geodesic distances.                                               neighboring nodes, are updated. The radius of neigh-
                                                                      bors to be updated is initially large, but it is reduced
 3. Apply the multidimensional scaling (MDS) algo-
                                                                      after each iteration.
    rithm on the distance matrix calculated above. The
                                                                                                            (i)
    following steps are part of the classical MDS algo-                    wt+1    t                           t
                                                                             jk = w jk + η(t)T j,BMU (t)(xk − w jk )       (23)
    rithm. Firstly, double centering is applied to the
                                                                      where η(t) is the decaying learning rate, BMU is the
    squared distance matrix, using the centering matrix:
                                                                      index of the Best Matching Unit and T j,BMU (t) is the
    H = I − n1 eeT , where e is a N · 1 column vector con-
                                                                      neighborhood function of the form:
    taining 1-s.
                               1                                                                                   2
                                                                                 T j,BMU (t) = e−dist( j,BMU)/2σ (t)
                       B = − HD2 H                    (19)                                                                 (24)
                               2
                                                                      If the distance between the two map nodes with in-
 4. Calculate the eigendecomposition of B.                            dices j and BMU is larger, then the output of the
                          B = PΛP−1                     (20)          neighborhood function will be smaller, and the j-th
                                                                      node will not get a significant update, as opposed to
     where P contains the eigenvectors and Λ is a diagonal            a closer node. The σ (t) parameter shrinks with time,
     matrix containing the eigenvalues.                               considering smaller and smaller neighborhoods.
 5. The lower dimensional embedding of the datapoints              5. If t exceeds the maximal iteration number, the algo-
    is given by:                                                      rithm stops.
                                          1
                      Xlow_dim = Pd 0 · Λd20            (21)   5     Evaluation
              1
     where Λ contains the square roots of the d 0 largest
              2
              d0                                               For the evaluation of the class separability on the resulting
     eigenvalues, and Pd 0 is the matrix containing the        2D map, external cluster analysis metrics are used to an-
     eigenvectors corresponding to the first d 0 largest       alyze the correspondence between the ground-truth labels
     eigenvalues.                                              and the results of the clustering algorithm.
5.1   Clustering                                                5.4    Hyperparameters

K-means is used for clustering, as it performed more pos-       A specific hyperparameter-configuration notation is intro-
itively during preliminary testing, as opposed to other         duced for presenting the results, which is formatted in the
methods like hierarchical clustering, DBSCAN or OP-             following way: F-FE-NrBin-Reshape-DimRed-Hyperp.
TICS. The number of centroids was fixed to the number           The meaning of the symbols and the possible values of
of classes, namely 5. The algorithm ran 10 times with dif-      the hyperparameters are described below:
ferent centroid seeds.
                                                                    • F: value between 0 and 5. 0 means no filtering, 1-5
                                                                      encode the cutoff frequency pairs identified in section
5.2   Evaluation metrics                                              4.1.

We concentrated on using external clustering evaluation             • FE: identifier of feature extraction method: STFT,
methods that compare the labels assigned by a clustering              Mel scaled STFT (mel), MFCCs or high-level fea-
algorithm with ground truth labels; similarly to [6], [8].            tures (feat).
Notation: TP - number of pairs of points that are in the
                                                                    • NrBin: number of frequency bins: 13, 20, 40 bins.
same cluster and in the same class, FP - number of pairs of
                                                                      Only valid for the STFT, Mel-STFT, MFCC methods.
points that are in the same cluster but in different classes,
TN - number of pairs of points that are in different clusters       • Reshape: identifier of reshaping method as described
and in different classes, FN - number of pairs of points that         in section 4.3:
are in different clusters but in the same class.
                                                                         1. Flattening.
                                                                         2. Calculating mean and standard deviation across
Adjusted Rand Index
                                                                            frames.
                            TP+TN                                        3. Calculating mean, standard deviation, mini-
               RI =                                     (25)
                      T P + FP + T N + FN                                   mum and maximum values across frames.
                                                                    • DimRed: Dimensionality reduction method, with 4
Fowlkes-Mallows score                                                 possibilities and possible further parameters for some
                  r
                           TP       TP                                methods (Hyperp):
              FM =              ·                       (26)
                        T P + FP T P + FN                                1. PCA - no further hyperparameter was identified,
                                                                            that would significantly influence the outcome.
V-measure It is calculated as the weighted (here β = 1)                  2. t-SNE - identified hyperparameter: Perplexity.
harmonic mean between homogeneity (h) and complete-                         Values to be tested: 20, 40, 60.
ness (c), which are also external cluster analysis metrics.              3. Isomap - identified hyperparameter: Number of
Homogeneity is satisfied when every cluster contains only                   neighbors. Values to be tested: 20, 40, 60.
data points which are members of a single class. Com-
pleteness is satisfied when all points that are members of               4. SOM - identified hyperparameter: Sigma. Val-
the same class, are also the members of the same cluster.                   ues to be tested: 1, 2, 4.

                           (1 + β ) · h · c                        All possible value combinations for the different hyper-
                    Vβ =                                (27)    parameters described above generates 1500 different runs.
                            (β · h) + c
                                                                Gridsearch was executed over the whole set of combina-
   The evaluation of the results with these supervised met-     tions. Each configuration was run 3 times to ease the ef-
rics measures the extent to which audio recordings of the       fects of random components.
same class get projected to points that are close to each
other and form spatially coherent, independent clusters.        6     Results

5.3   Distribution of data                                      Table 1., Table 5., Table 6. show the top 10, the middle
                                                                10 and the worst 10 results according to the mean of the
Five categories of sound samples were used from the Ur-         clustering indices. Similarly, Table 2., Table 3., Table 4
banSound8k dataset: car horn (150 samples), dog bark            show the best 3 results for each dimensionality reduction
(150 samples), street music (142 samples), gunshot (150         method. As one can see processes containing MFCC as
samples), idling engine (150 samples).                          feature extraction method and TSNE as dimensionality
  The recordings of each category contain recordings of         reduction method performed the best and are presented in
varying noise levels, environmental sounds and overlap-         the top results. Not just the top 10 but the best 26 results
ping sounds.                                                    are all variations of this configuration. Although several
                   Figure 2: Best result.                                         Figure 4: Third best result.


                                                                       Configuration             Σ       ARI      FM       V-m
                                                                   3-mfcc-40-2-pca-False       0.404    0.319    0.477    0.417
                                                                   4-mfcc-40-2-pca-False       0.393    0.302    0.466    0.411
                                                                   2-mfcc-40-2-pca-False       0.376    0.286    0.450    0.390

                                                                                 Table 2: Best 3 PCA results

                                                                       Configuration              Σ       ARI      FM       V-m
                                                                   0-mfcc-40-2-isomap-40        0.423    0.332    0.486    0.452
                                                                   0-mfcc-40-2-isomap-60        0.422    0.331    0.484    0.451
                                                                   0-mfcc-40-3-isomap-40        0.412    0.318    0.476    0.441
              Figure 3: Second best result.
                                                                                Table 3: Best 3 Isomap results

researchers didn’t find significant differences between the          Configuration            Σ      ARI       FM       V-m
performance of audio feature extraction methods e.g.:              5-mfcc-40-3-som-1        0.399   0.326     0.470    0.401
[3], our research showed the superiority of MFCC-TSNE              4-mfcc-40-2-som-1        0.399   0.326     0.476    0.395
combination compared to other examined combinations.               4-mfcc-40-2-som-4        0.398   0.339     0.472    0.382
MFCC outperforms the STFT and the high level statistical
feature extraction (which are more general methods for                           Table 4: Best 3 results SOM
not only audio, but other signals as well). TSNE performs
well with every feature extraction method: it occupies the
top 5 results for each feature extraction method in our                 Configuration            Σ       ARI      FM       V-m
experiments.                                                         3-mel-13-2-tsne-60        0.213    0.143    0.317    0.180
                                                                     0-mel-40-2-som-2          0.213    0.126    0.326    0.187
    Configuration           Σ        ARI      FM        V-m        4-mfcc-13-1-pca-False       0.213    0.138    0.320    0.181
 5-mfcc-40-2-tsne-20      0.525     0.466    0.579     0.530          5-feat-6-2-tsne-40       0.213    0.134    0.308    0.196
 4-mfcc-40-2-tsne-20      0.510     0.455    0.568     0.506        4-feat-6-2-isomap-60       0.212    0.115    0.334    0.189
 4-mfcc-40-3-tsne-20      0.496     0.439    0.553     0.497         4-mel-20-3-som-2          0.212    0.132    0.329    0.177
 4-mfcc-40-3-tsne-40      0.494     0.435    0.551     0.498          1-feat-6-3-som-1         0.212    0.138    0.312    0.187
 3-mfcc-40-3-tsne-20      0.492     0.435    0.550     0.491       2-mfcc-13-1-pca-False       0.212    0.137    0.323    0.177
 5-mfcc-40-3-tsne-20      0.488     0.429    0.545     0.488         0-mel-13-2-tsne-20        0.212    0.139    0.315    0.183
 2-mfcc-40-2-tsne-20      0.484     0.426    0.545     0.482        3-stft-20-2-pca-False      0.212    0.074    0.369    0.194
 0-mfcc-40-3-tsne-40      0.480     0.419    0.537     0.485
 1-mfcc-40-3-tsne-20      0.479     0.422    0.540     0.474                      Table 5: Middle 10 results
 0-mfcc-40-2-tsne-40      0.478     0.419    0.537     0.479
   Legend: Σ = Mean of all cluster metrics, ARI = adjusted Rand      According to the results, the V-measure index is the
   index, FM = Fowlkes-Mallows index, V-m. = V-measure            closest to the mean of all albeit it goes without saying that
                                                                  the mean is heavily dependent on the selected indexes. The
       Table 1: Cluster evaluation results: best 10.              mean for the top results is very close to the value of the V-
  The top 3 results are illustrated on Figures 2, 3 and 4.        Measure index. Unfortunately the middle and the worst
results are not correlating to the V-Measure index.             structured in such way. PCA also had poor results in the
                                                                majority of the cases, probably in consequence of assum-
        Configuration        Σ       ARI       FM      V-m      ing that the data lies on a linear subspace, thus being un-
      4-stft-13-1-som-1    0.137    0.060     0.260   0.090     able to discover nonlinear relations.
    5-feat-6-1-isomap-40   0.137    0.050     0.267   0.092
     4-mel-20-1-som-1      0.136    0.037     0.293   0.079     7.1   Future work
      4-feat-6-1-som-1     0.133    0.060     0.252   0.087
                                                                Our future plan is to refine the developed process, elab-
     0-mel-13-1-som-1      0.133    0.047     0.271   0.081
                                                                orate new ones and apply the accumulated knowledge in
      2-feat-6-1-som-1     0.132    0.059     0.254   0.084
                                                                medical and agricultural research, through classification of
      0-feat-6-1-som-1     0.132    0.057     0.254   0.086
                                                                breathing sounds (COVID-19 detection) and bee sounds.
      1-feat-6-1-som-4     0.130    0.057     0.247   0.085
      2-feat-6-1-som-2     0.126    0.054     0.247   0.077
      4-feat-6-1-som-2     0.120    0.046     0.241   0.075     References
                  Table 6: Worst 10 results                     [1] C. Lo, "Nonlinear Dimensionality Reduction for Music Fea-
                                                                    ture Extraction", Tech. Rep. CSC2515, Toronto University,
                                                                    2012
7     Conclusions                                               [2] G. Roma, O. Green, P.A. Tremblay, "Adaptive Mapping of
                                                                    Sound Collections for Data-driven Musical Interfaces", The
As it has been pointed out, the MFCC-TSNE combination               International Conference on New Interfaces for Musical Ex-
achieved the best results. MFCC’s success is supported by           pression, 2019
its widespread use in environmental sound classification        [3] J. Jensen, M. Christensen, M. Murthi, S. Jensen, "Evalua-
[13] and it is also recommended and used by the authors of          tion of MFCC estimation techniques for music similarity",
the dataset [5]. Apart from this observation, other conclu-         Proceedings of the EUSIPC, 2006
sions can be drawn. By examining the impact of filtering,       [4] J. B. Tenenbaum, V. de Silva, J. C. Langford, "A Global
it is obvious that none of the five filtering variants were         Geometric Framework for Nonlinear Dimensionality Reduc-
                                                                    tion", Science 290 (5500), pp 2319-2323, 2000
able to improve the results. Mel-scaled STFT and high-
level features produced inefficient data representations and    [5] J. Salamon, C. Jacoby, J. P. Bello, "A Dataset and Taxon-
                                                                    omy for Urban Sound Research", 22nd ACM International
received weak evaluation scores. The poor performance
                                                                    Conference on Multimedia, Orlando USA, 2014.
of the Mel-scale could be attributed to it being optimized
                                                                [6] K. Kim et al., “Clustering Music by Genres Using Super-
for speech. Using just 6 high-level descriptors is probably
                                                                    vised and Unsupervised Algorithms”, 2015
insufficient to represent the audio data well. The perfor-
                                                                [7] L. Fedden, "Comparative Audio Analysis With Wavenet,
mance of these feature extraction methods is dependant
                                                                    MFCCs, UMAP, t-SNE and PCA", https://medium.c
on the nature of the dataset, some methods may perform              om/@LeonFedden/comparative-audio-analysis-wit
better for some specific type of sounds. Considering the            h-wavenet-mfccs-umap-t-sne-and-pca-cb8237bfc
optimal number of frequency bins, higher values achieved            e2f, 2017
better results, i.e. using 40 frequency bins proved to be the   [8] L. Hantrakul, A. Sarwate, "klustr: a Tool for Dimensionality
best choice. This is understandable, as more dimensions             Reduction and Visualization of Large Audio Datasets", ht
can hold more data, but it is likely that based on the curse        tps://github.com/lamtharnhantrakul/klustr/bl
of dimensionality, there is an upper limit at which perfor-         ob/master/paper/klustr_final.pdf, 2017
mance starts to degrade. As far as the optimal reshaping        [9] L. van der Maaten, G. Hinton, "Visualizing Data using t-
method is concerned, both the second and third methods              SNE", Journal of Machine Learning Research 9, pp 2579-
attained top 5 positions, while the first method (flattening)       2605, 2008
had a significantly weaker performance, as it is included       [10] P. Mermelstein, "Distance Measures for Speech Recogni-
in every list of worst performances. Yet again, the poor            tion, Psychological and Instrumental", Pattern Recognition
performance of the flattening method is a consequence of            and Artificial Intelligence, pp. 374–388., 1976
the curse of dimensionality phenomenon.                         [11] S. Dupont et al., "Nonlinear Dimensionality Reduction Ap-
   Considering dimensionality reduction methods, t-SNE              proaches Applied to Music and Textural Sounds", IEEE In-
clearly emerged as a winner, both in a visual and metri-            ternational Conference on Multimedia and Expo (ICME),
cal sense. TSNE is famous for generally producing well              2013
separated, good visualizations, as pointed out by its au-       [12] T. Kohonen, "Self-Organized Formation of Topologically
thors [9]. While t-SNE achieved the best results according          Correct Feature Maps", Biological Cybernetics 43, p. 59-69,
to the evaluation metrics, SOM also succeeded in produc-            1982
ing well separable, good quality clusterings. For Isomap,       [13] Z. Shuyang et al., "Active Learning for Sound Event
many resulting plots were elongated and inhomogeneous.              Classification by Clustering Unlabeled Data", International
                                                                    Conference on Acoustics, Speech and Signal Processing
This may be explained by the fact that Isomap expects the
                                                                    (ICASSP), 2017
data to lie on a manifold, but in actuality the data is not