Music Genre Recognition*
                                Adam Grelewicz1,∗,†, Mateusz Lis1,† and Dawid Michalak1,∗,†
                                1
                                 Faculty of Applied Mathematics, Silesian University of Technology, Kaszubska 23, 44-100 Gliwice, POLAND


                                                 Abstract
                                                 Sound analysis plays a crucial role in identifying various types of defects in collaboration with artificial
                                                 intelligence system models. To design a well-functioning model, a thorough data analysis is essential.
                                                 Therefore, this article presents the implementation of the MFCC algorithm for different music genres.
                                                 The algorithm is supported by high-pass and triangular filters. The recording will be transformed using the
                                                 discrete Fourier transform (DFT). Then, the correctness of the algorithm will be verified using the KNN
                                                 classifier and Naive Bayes to check the correct identification of the music genre. The project was
                                                 conducted on a publicly available dataset. The results for the KNN classifier are very satisfactory.
                                                 Additionally, this article demonstrates the superiority of the KNN classifier over Bayes for sound analysis.
                                                 Keywords
                                                 MFCC algorithm, Genre Recognition, Naive Bayes, KNN,


                                1. Introduction
                                Sound is a wave that arises from changes in atmospheric pressure caused by vibration [1].
                                Combined with artificial intelligence systems, it can have broad applications in various fields. In
                                medicine, image recognition using deep learning is utilized. In [2], models are used to help
                                specialists diagnose diseases more quickly. It is worth noting that sound contains a lot of
                                information. Based on sound, certain abnormalities can be detected. In [3], the use of heart
                                sounds for early disease detection is excellently demonstrated, allowing for earlier treatment. In article
                                [12], there is another medical application, namely recognizing people with Parkinson’s disease
                                from recorded voice samples. The average accuracy of this method is around 90%.
                                   Therefore sound should be converted into a spectrogram, and then a model for image recog-
                                nition should be used. A spectrogram is a visual representation of the intensity of a signal over
                                time, with respect to different frequencies present in a given waveform. The evaluation of spec-
                                trograms involves transforming the signal from the time domain to the frequency domain using
                                the Fourier transform [4]. In [4], it is shown that sound can also be used in the food industry
                                to identify various food products. In the following articles [8][9][10][11], various techniques
                                utilizing sound recognition are described, such as Environmental Sound Recognition (ESR) and
                                Automatic Sound Recognition (ASR), which can be used in a smart home. A smart home, along
                                with artificial intelligence methods, can provide support for people, reduce exploration costs, and
                                improve energy efficiency [13][14]. Therefore, this field also utilizes sound recognition. This
                                mechanism can be used as one of the biometric security measures for homes [15]. However, the sound
                                processing scheme is the same.
                                   All these articles demonstrate that data analysis is very important for the application of neural
                                networks. In particular, sound must be properly processed. Sound, especially human speech or
                                music, has certain features that can be used for its characterization, such as a unique human
                                voice, communication method-specific noise, or the use of similar instruments in musical pieces of
                                the same genre [1]. Therefore, to extract the most important features of a sound signal in the
                                form of a coefficient matrix, the MFCC algorithm, which will be described in detail in this
                                article. Later in this article, there will be a comparison of two classifiers: KNN and Naive Bayes.


                                *IVUS2024: Information Society and University Studies 2024, May 17, Kaunas, Lithuania
                                1,∗
CEUR
                  ceur-ws.org         Corresponding author
Workshop                        †
                                      These author contributed equally.
              ISSN 1613-0073
Proceedings

                                       ml307892@student.polsl.pl (M.Lis); dm307899@student.polsl.pl (D. Michalak); ag307868@student.polsl.pl (A. Grelewicz)
                                               ©️ 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2. Methodology
2.1. The MFCC Algorithm (Mel-frequency cepstral coefficients)
Before describing the MFCC algorithm itself, certain concepts need to be defined:
   1. Mel scale - a scale of pitches that measures the perceived frequency of sound, in contrast to
      the objective frequency scale measured in hertz.
      The function for converting a frequency in hertz to the Mel scale:
                                                                𝑓
                                       𝑚(𝑓 ) = 1125 log (1 +       )                             (1)
                                                               700
      The inverse function:                                𝑚
                                       𝑓 (𝑚) = 700(exp       — 1)                                (2)
                                                       1125
   2. Window function is a function that takes non-zero values only within a specified interval.
      That functions are used to filter signals.


      Figure 1: An example of a window function


  For the purposes of mathematical description, let’s introduce the following notation:
1. {...}* denotes an array, which is a set where elements can repeat and maintain order. An
   array of arrays is called a matrix.
2. If 𝐴 is an array, the notation 𝐴[𝑘] means the k-th element of 𝐴.
3. If 𝐴 is an array, the notation 𝐴[: 𝑘] means the first k elements of 𝐴.
4. All other operations on arrays work similarly to operations on sets.

Description of the MFCC algorithm:
1. Let:
      • 𝑛 be the number of samples in the input signal,
      • 𝑋 = {𝑥0, 𝑥1, ..., 𝑥𝑛−1}* be the input signal,
      • ℎ be the sampling frequency of the input signal in hertz,
      • 𝑡 be the number of triangular filters,
      • 𝑢 be the number of numbers transformed by the discrete Fourier transform,
      • 𝑙 be the length of the window in samples,
      • 𝑠 be the number of samples by which the window is shifted and 0 < 𝑠 ≤ 𝑙,
      • 𝑐 be the number of cepstral coefficients.
2. Filters are applied to remove noise. This step is optional, but noise removal improves
   accuracy, so a high-pass filter is used in the form of:

                                     𝑦𝑘 = 𝑥𝑘 − 0.97𝑥𝑘+1
                                     𝑘 ∈ {0, 1, 2, ..., 𝑛 − 2}*                               (3)

3. Triangular filters are created to extract desired features from the input signal while
   omitting unnecessary ones. These filters are distributed on a frequency scale between 0 and
   ℎ
     . Initially, the boundaries of this scale are converted from frequency in hertz to the Mel
   scale.2
                               {︃ 𝑎 = 𝑚(0) = 0
                                  𝑏 = 𝑚( ℎ ) = 1125 log (1 + ℎ                               (4)
                                  )
                                        2                         1400
   An array of 𝑡 + 2 numbers is created, evenly distributed between 𝑎 and 𝑏.

                          𝑈 = {𝑎 + 𝑘Δ𝑥 : 𝑘 ∈ {0, 1, 2, ..., 𝑡 + 1}*}*
                                  𝑏−𝑎                                                        (5)
                           Δ𝑥 =
                                 𝑡+1
   Another array is created containing the scaled elements of array 𝑈 , converted from the Mel
   scale to the frequency scale in hertz and rounded down.
                                     ⌊︂           ⌋︂
                                       𝑢+1
                                𝐵={          𝑓 (𝑥) : 𝑥 ∈ 𝑈 }*                             (6)
                                           ℎ

   Array 𝐵 contains non-linearly distributed numbers from 0 to ⌊ 𝑢2⌋.
  Another array is created to contain the triangular filters mentioned at the beginning.
                                    {︁{︁                           ⌊︁𝑢 ⌋︁}︁*
                                𝐹 = 𝑔(𝑘, 𝑗) : 𝑗 ∈ {0, 1, 2, . . . , }* :
                                                                             2           (7)
                                                            }︁*
                               𝑘 ∈ {0, 1, 2, . . . , 𝑡 − 1}*
  where:
                                                 𝑗−𝐵[𝑘]
                                              𝐵[𝑘+1]−𝐵[𝑘]
                                                                    , ⌊𝐵[𝑘]⌋ ≤ 𝑗 ∧
                                         ⎪                          𝑗 < ⌊𝐵[𝑘 + 1]⌋
                             𝑔(𝑘, 𝑗) =          𝐵[𝑘+2]−𝑗            , ⌊𝐵[𝑘 + 1]⌋ ≤ 𝑗 ∧
                                              𝐵[𝑘+2]−𝐵[𝑘+1]
                                         ⎧⎨                         𝑗 < ⌊𝐵[𝑘 + 2]⌋
                                         ⎪
                                         ⎪
                                         ⎪
                                         ⎩0                         , for other 𝑗
4. The input signal is divided into windows, where a window is defined as:
                             𝑊 (𝑖) = {𝑝(𝑘) ⌊︂: 𝑠𝑖 ≤ 𝑘 < 𝑠𝑖 + 𝑙}* ⌋︂
                                                𝑛 + 𝑠 − 𝑛 mod 𝑠                                 (8)
                             𝑖 ∈ {0, 1, 2, ...,                    − 1}*
                                                                 𝑠
  where:                                           {︃
                                                        𝑥𝑘    ,0≤𝑘<𝑛
                                         𝑝(𝑘) =         0     ,𝑛≤𝑘
  𝑖 is the window index.
  The power spectrum is calculated, i.e., the discrete Fourier transform (DFT) of the first 𝑢
  elements from the array 𝑊 (𝑖), then square each number in the resulting array and scale
  these elements by 1 .
                       𝑙
                                    𝑃 (𝑖) = 1 𝑆(DFT(𝑊 (𝑖), 𝑢))                            (9)
                                            𝑙
  where 𝑃 (𝑖) denotes the i-th power spectrum for the i-th window 𝑊 (𝑖) and
                                   𝑆({𝑠0, 𝑠1, ..., 𝑠𝑛}*) = {|𝑠0|2, |𝑠1|2, ..., |𝑠𝑛|2}*
  Absolute values are required in function 𝑆 as 𝑠𝑘 can be complex numbers.
  The previously calculated filters are then utilized to filter the power spectrum via the
  matrix product of 𝑃 (𝑖) and the transpose of matrix 𝐹 .
                                                𝐶(𝑖) = 𝑃 (𝑖)𝐹 F                                (10)
  The final step is to compute the natural logarithm for each element of 𝐶(𝑖) and transform these
  logarithms using the discrete cosine transform (DCT) of type II.
                                     𝑅(𝑖) = DCT(𝐿(𝐶(𝑖)))[: 𝑐]                                  (11)
  where 𝐿({𝑥0, 𝑥1, ..., 𝑥𝑛} ) = {ln 𝑥0, ln 𝑥1, ..., ln 𝑥𝑛}*.
                         *


  The result of the algorithm is a matrix:
                                       𝑅 = {𝑅(0), 𝑅(1), 𝑅(2), ...
                                    ⌊︂               ⌋︂
                                     𝑛 + 𝑠 − 𝑛 mod 𝑠      *
                                                                                               (12)
                               , 𝑅(               𝑠 − 1)}
2.2. KNN (k-Nearest-Neighbours)
The K-Nearest Neighbors (KNN) algorithm is a classification and regression method that utilizes the
similarity between data points. It operates by finding the nearest neighbors (data points) to a new
point and uses their information to predict the class or value for that point [5]. Before
describing the KNN algorithm itself, certain concepts need to be defined:
   Value of k - The number of neighbors to be considered during classification or regression.
Mahalanobis distance - It considers the correlations between two vectors x and y with covariance
matrix S and scales distances depending on the distribution of data. It is given by the formula:
                                          √︁
                              𝐷𝑀 (𝑥, 𝑦) = (𝑥 − 𝑦)⊤𝑆−1(𝑥 − 𝑦)                                     (13)

2.3. Naive Bayes classifier
The Naive Bayes classifier is a machine learning method used for classifying data into deci-
sion classes. Despite its simplicity, it has a wide range of applications in text classification,
medical diagnosis, and system performance management. The task of the Bayes classifier is to
assign a new case to one of the classes [6][7]. Each training example is described by a set of
conditional attributes {𝑋𝑖} and one decision attribute 𝐷. According to Bayes’ theorem, the
most probable class to which a new object, described by the values of n-conditional at-
tributes ⟨𝑥𝑗1, 𝑥𝑗2, . . . , 𝑥𝑗𝑛⟩, belongs is the class 𝑑𝑖 that maximizes the conditional probability
𝑃 (𝑑𝑖 | 𝑥𝑗1, 𝑥𝑗2, . . . , 𝑥𝑗𝑛).
                           𝑑 = arg max 𝑃 (𝑑𝑖) · 𝑃 (𝑥𝑗1, 𝑥𝑗2, . . . , 𝑥𝑗𝑛 | 𝑑𝑖)                           (14)
                                         𝑑𝑖∈𝑉𝐷

The probability 𝑃 (𝑑𝑖) can be estimated as the ratio of the number of training examples belonging to
class 𝑑𝑖 to the total number of training examples. To estimate 𝑃 (𝑥𝑗1, 𝑥𝑗2, . . . , 𝑥𝑗𝑛 | 𝑑𝑖), the Naive Bayes
classifier assumes the conditional independence of attributes:

                                                                 ∏︁𝑛
                             𝑃 (𝑥𝑗1, 𝑥𝑗2, . . . , 𝑥𝑗𝑛 | 𝑑𝑖) =           𝑃 (𝑥𝑗𝑘 | 𝑑𝑖)                     (15)
                                                                  𝑘=1

The probability 𝑃 (𝑥𝑗𝑘 | 𝑑𝑖) can be estimated as the ratio of the number of training examples in class 𝑑𝑖 for
which the attribute 𝑋 𝑗𝑘 has the value 𝑥𝑗𝑘 to the total number of training examples in class 𝑑𝑖.
Considering this assumption, the class 𝑑𝑁𝐵 (Naïve Bayes) chosen for a new example is:
                                                               ∏︁𝑛
                                 𝑑𝑁𝐵 = arg max 𝑃 (𝑑𝑖) ·            𝑃 (𝑥𝑗𝑘 | 𝑑𝑖)                         (16)
                                                    𝑑𝑖∈𝑉𝐷
                                                                𝑘=1

3. Experiments
The example of MFCC algorithm will be conducted using the file classical.00000.wav.
   1. The sampling frequency for this file is ℎ = 22050 Hz.
      The length of the file is 30.013 s.
  The number of samples is 𝑛 = 661794.
  The sound samples are 𝑋 = {𝑥0, 𝑥1, ..., 𝑥𝑛−1}*. The
  plot of the samples of this file is on Figure 2


  Figure 2: Representation of samples in audio.


   The number of triangular filters 𝑡 = 26.
   The number of values transformed by the discrete Fourier transform 𝑢 = 512. The
   length of the window in samples 𝑙 = 551.
   The number of samples by which we shift the window 𝑠 = 220. The
   number of cepstral coefficients 𝑐 = 13.
2. After applying the high-pass filter, the samples look as follows on Figure 3


  Figure 3: That sample after it was filtered.


  The difference between the samples before and after the filter is not visible at first glance,
      but applying this filter to each file improved accuracy by about 4%.
   3. Triangular filters are created. According to the formulas in the algorithm description, the
      boundary frequencies are obtained.
                              ⎧
                                𝑎 = 𝑚(0) = 0
                                𝑏 = 𝑚( 22050 ) = 1125 log (1 + 22050 ) =
                              ⎪
                              ⎪            2                      1400
                                                     = 1125 log 67
                              ⎩
                              ⎨
                                                                  4

      An array of 𝑡 + 2 numbers evenly distributed between 𝑎 and 𝑏 is created.
                                       1125 log 674
                                Δ𝑥 =                  = 125         67
                                            27                log
                                                         3          4
                                 𝑈 = {𝑎 + 𝑘Δ𝑥 : 𝑘 ∈ {0, 1, 2, ..., 27}*}*
                                     125𝑘      67
                              ⇒ 𝑈={        log    : 𝑘 ∈ {0, 1, 2, ..., 27}*}*
                                      3       4
                            ⇒ 𝑈 = {0, 125     67 250    67
                                           log ,     log , ...,
                                        3      4 3       4
                             3375      67 *
                               3 log 4 }
       The array 𝐵 is created according to the formula in the description.
                                        ⌊︂ 513           ⌋︂
                                𝐵= {             𝑓 (𝑥) : 𝑥 ∈ 𝑈 }*
                                            22050
                                        ⌊︂ 513          ⌋︂
                            ⇒ 𝐵={                  𝑓 (0) ,
                                ⌊︂          22050       ⌋︂
                                   513      125     67
                                        𝑓(      log ) , ...,
                                   22050        3         4
                                   ⌊︂                        ⌋︂
                                      513      3375        67 *
                                   22050 𝑓 ( 3 log 4 ) }
                                              ⌊︃        √︂          ⌋︃
                                                5130 2 7 67
                               ⇒ 𝐵 = {⌊0⌋,            (         − 1) , ...,
                                                  315        4
                               ⌊256.5⌋}*
                           ⇒ 𝐵 = {0, 1, ..., 256}*
                               The array 𝐵 looks as follows:
                           ⇒ 𝐵 = {0, 1, 3, 5, 8, 11, 14, 17, 21, 25, 29, 35
                               , 40, 46, 53, 61, 70, 79, 90, 102, 115, 129,
                               145, 163, 183, 205, 229, 256}*
                                                                         ⌊︀ ⌋︀
  As can be seen, the array 𝐵 contains numbers ranging from 0 to 𝑢 2 = 256.
  As can be seen, these numbers are not evenly distributed, meaning the differences between
consecutive numbers increase as the elements progress. This is because the boundary frequencies
    𝑎 and 𝑏 were converted from the frequency scale in Hertz to the Mel scale, which is nonlinear. The
    reason why a change to a nonlinear scale was required will be explained later in the example.
       Triangular filters are created, again, according to the formula from the description:

                                            𝐹 = {{𝑔(𝑘 , 𝑗) : 𝑗 ∈ {0, 1, 2, ..., 256}*}* :
                                                                𝑘 ∈ {0, 1, 2, ..., 25}*}*

    For 𝑘 = 0 the function 𝑔(0, 𝑗) will look like:
                                           ⎧
                                                      𝑗−𝐵[0]     , ⌊𝐵[0]⌋ ≤ 𝑗 < ⌊𝐵[1]⌋
                                                    𝐵[1]−𝐵[0]
                                      𝑔(0, 𝑗) = ⎪      𝐵[2]−𝑗
                                                    𝐵[2]−𝐵[1]
                                                                 , ⌊𝐵[1]⌋ ≤ 𝑗 < ⌊𝐵[2]⌋

                                               ⎨

                                               ⎪
                                               ⎪
                                                 0               , for other 𝑗
⎧                                              ⎩
    ⎪⎨𝑗     , 0 ≤ 𝑗< 1
                                                    3− 𝑗
                               ⇒ 𝑔(0, 𝑗 ) =                ,1≤𝑗<3
                                               ⎪ 2
                                                 0         , for other 𝑗
                                               ⎩
    The array {𝑔(0, 𝑗) : 𝑗 ∈ {0, 1, 2, ..., 256}*}* (the first element of 𝐹 ) looks like:
                                                                 * *           1              *
                          {𝑔(0, 𝑗) : 𝑗 ∈ {0, 1, 2, ..., 256} }       = {0,       , 0, ..., 0}
                                                                     1,        2
    Representing this filter on a graph on figure 4.


    Figure 4: Representation of a triangular filter for that sample


      As can be seen, this filter is very narrow. It passes only what is at the beginning and zeroes out
    the rest.
      After calculating all the values 𝑔(𝑘, 𝑗), all 𝑡 = 26 filters can be represented on a graph on figure 5
      Due to the application of the Mel scale to distribute these filters, the highest density is at the
    beginning, and the lowest at the end.
Figure 5: Representation of the triangular filterbank


  The reason for needing the Mel scale is that it accurately represents how humans perceive
sound. It turns out that most useful information is in the lower frequencies, not the higher
ones. Therefore, it makes sense to place more filters at the beginning, which was achieved
by converting the frequency scale in Hertz to the Mel scale. Without this, all filters would be
evenly distributed across the entire scale.
  It was tested what would happen if evenly distributed filters were used, and it degraded the
accuracy by about 5%. The input signal is divided into windows.

                              𝑊 (𝑖) = {𝑝(𝑘) : 220𝑖 ≤ 𝑘 < 220𝑖 + 551}*
                              , 𝑖 ∈ {0, 1, ..., 3008}*
                                        {︃
                                             𝑥𝑘   , 0 ≤ 𝑘 < 661794
                              𝑝(𝑘) =         0    , 661794 ≤ 𝑘

For the window 𝑖 = 0:
                         𝑊 (0) = {𝑝(𝑘) : 0 ≤ 𝑘 < 551}*
                       ⇒ 𝑊 (0) = {𝑥0, 𝑥1, 𝑥2, ..., 𝑥550}*
                       ⇒ 𝑊 (0) = {−102.19, −705.89, −136.54, ..., 123.9}*

For the window 𝑖 = 1:
                                   𝑊 (1) = {𝑝(𝑘) : 220 ≤ 𝑘 < 771}*
                                ⇒ 𝑊 (1) = {𝑥220, 𝑥221, 𝑥222, ..., 𝑥770}*

For the window 𝑖 = 3008 (the last window):

                              𝑊 (3008) = {𝑝(𝑘) : 661760 ≤ 𝑘 < 662311}*
                              ⇒ 𝑊 (3008) = {𝑥661760, 𝑥661761, 𝑥661762, ...,
                              𝑥661793, 0, 0, ..., 0}*
It is worth noting that for 𝑖 = 3008 the index 𝑘 goes beyond the input signal, so there are zeros at
the end.
  Now the power spectral density is calculated:
                                                1

                                     𝑃 (𝑖) =         𝑆(DFT(𝑊 (𝑖), 512))
                                               551
For the window 𝑖 = 0:

                                        1
                           𝑃 (0) = 𝑆(DFT(𝑊 (0), 512))
                               551
                      ⇒ 𝑃 (0) = 1 𝑆(DFT({𝑥 , 𝑥 , 𝑥 , ..., 𝑥                     }*, 512))
                                                           0     1   2    550
                                    551
                        ⇒ 𝑃 (0) = {476.280941, 6620.61200, 8149.37465,
                         ..., 195.371343}*

Having the power spectral density, the matrix product 𝑃 (𝑖) and 𝐹 F is calculated, which will be the
operation of filtering frequencies according to the triangular filters previously established.

                                               𝐶(𝑖) = 𝑃 (𝑖)𝐹 F                                    (17)

For 𝑖 = 0 we get:

                       𝐶(0) = 𝑃 (0)𝐹 F
                     ⇒ 𝐶(0) = {10695.2993, 31658.4727, 20555.0554, ...
                         , 18245.6147}*

Finally, the discrete cosine transform of the logarithms of 𝐶(𝑖) is calculated, taking only the first 𝑐
= 13 elements:
                                      𝑅(𝑖) = DCT(𝐿(𝐶(𝑖)))[: 13]
For 𝑖 = 0 we get:
                           𝑅(0) = DCT(𝐿(𝐶(0)))[: 13]
                         ⇒ 𝑅(0) = DCT(𝐿({10695.2993, 31658.4727,
                           20555.0554, ..., 18245.6147}*))[: 13]
                         ⇒ 𝑅(0) = {62.5650537, −2.03586229,
                           — 5.32321543, ..., −1.89111160}*
Finally, the array 𝑅 is obtained, which is:

                                𝑅 = {𝑅(0), 𝑅(1), 𝑅(2), ..., 𝑅(3008)}*

Thus, the input signal is represented in the form of a matrix of cepstral coefficients.
  Analysis of the results is conducted for 6 classes of abstraction, with the following music
genres:
    • Classical music,
    • Disco,
    • Hip-hop,
    • Metal,
    • Blues,
    • Country.

   For each genre, there are 100 assigned tracks, each lasting 30 seconds. The split between
training and test data is 70:30.
   Before conducting a detailed analysis, it is important to determine the most effective value of k
for the KNN classifier. According to the Table 2, it can be seen that the most effective value is k =
5, therefore this value should be adopted for the analysis.

                                            𝑘     Accuracy
                                           3       77.78%
                                           4       78.89%
                                           5       80.56%
                                           6       77.78%
                                           7       77.22%
                                           8       77.78%
                                           9       76.67%
                                           10      75.56%
Table 1
Accuracy of the KNN algorithm depending on the number of neighbors.

  The next step is to evaluate the obtained matrices with the KNN classifier and Naive Bayes.
Performance evaluation metrics such as accuracy, loss, precision, recall, and F1 score will be
used to assess the effectiveness of these methods [4]. These metrics are essential for evaluating the
performance of machine learning models and are described by the following equations:
                                                     𝑇𝑃+𝑇𝑁
                               Accuracy =                                                        (18)
                                             𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
                                                 𝐹𝑃 +𝐹𝑁                                          (19)
                                  Loss = 𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁
                                                       𝑇𝑃
                                     Precision =                                                 (20)
                                                     𝑇𝑃+𝐹𝑃
                                                      𝑇𝑃
                                       Recall =                                                  (21)
                                                  𝑇𝑃+𝐹𝑁
                                               Precision × Recall                                (22)
                                F1 Score = 2 · Precision + Recall
  where:
    • TP (True Positive), which is the number of cases where the model correctly classified
      positive instances.
    • TN (True Negative), which is the number of cases where the model correctly classified
      negative instances.
    • FP (False Positive), which is the number of cases where the model incorrectly classified
      negative instances as positive.
    • FN (False Negative), which is the number of cases where the model incorrectly classified
      positive instances as negative.

                                             Class   Disco   Hiphop   Metal   Blues   Country
                               Accuracy      0.983   0.878    0.939   0.950   0.950    0.911
                                 Sen-Rec     0.926   0.767    0.968   0.865   0.667    0.607
                                Precision    0.962   0.605    0.750   0.889   1.000    0.773
                                   F1        0.943   0.677    0.845   0.877   0.800    0.680
                               Specificity   0.993   0.900    0.933   0.972   1.000    0.967

Table 2
Performance evaluation metrics for 6 genres with KNN.


                                             Class   Disco   Hiphop   Metal   Blues   Country
                               Accuracy      0.839   0.767    0.867   0.850   0.772    0.639
                                 Sen-Rec     0.185   0.367    0.258   0.594   0.074    0.643
                                Precision    0.417   0.323    0.889   0.647   0.111    0.247
                                   F1        0.256   0.344    0.400   0.620   0.089    0.356
                               Specificity   0.954   0.847    0.993   0.916   0.895    0.638

Table 3
Performance evaluation metrics for 6 genres with Naive Bayes .

   As we can see from the Table 2, 3, the metric values are very good for KNN with 6 classes of
abstraction, whereas Naive Bayes performs significantly worse. In terms of accuracy for the
entire test set, KNN achieved 80.56%, while the naive Bayes classifier achieved 36.67%. On the
Figure 6, 7 we observe the confusion matrix. An ideal confusion matrix has 100% on the diagonal, and
the rest should be 0%. For KNN, the matrix is nearly ideal. The classifier performed worst for
the disco music genre. However, for Naive Bayes, the confusion matrix does not resemble the
ideal one. Nevertheless, the algorithm performed best for the metal genre.

4. Conclusion
This MFCC algorithm allows for highly efficient classification of music genres with the KNN
classifier. The covariance matrix effectively extracted features from the audio signal and could be
used for commercial purposes such as in medicine. In the case of classifiers, KNN uses
distance metrics that can be very effective in measuring similarities between musical pieces.
Additionally, it does not assume any specific form of the classification function, relying instead on
local similarities, which is why it worked perfectly here. On the other hand, the advantages of
using the naive Bayes classifier are its simplicity and speed compared to KNN. Naive Bayes
assumes that the features are independent, which is rarely true for audio data where different
features can be strongly correlated. In this project, only one feature was used, namely the mean
Figure 6: Confusion matrix for the KNN


Figure 7: Confusion matrix for the naive Bayes classifier


of the sum of all elements of the matrix, which may have influenced the low accuracy compared
to KNN. To achieve high accuracy, more advanced methods such as neural networks like CNN
and RNN should be used. In the future, based on the matrix obtained from the MFCC algorithm, a
spectrogram can be created and the given algorithm can be tested on more complex models to
achieve better results.

References
 [1] Malgorzata Przedpelska-Bieniek, “Dzwiek i akustyka. Nauka o dzwieku,” 2011.
 [2] Marcin Woźniak, Jakub Siłka, Michał Wieczorek, “Deep neural network correlation learning
     mechanism for CT brain tumor detection,” 2021.
 [3] Junxin Chen, Zhihuan Guo, Xu Xu, Li-bo Zhang, Yue Teng, Yongyong Chen, Marcin
     Woźniak, Wei Wang, “A Robust Deep Learning Framework Based on Spectrograms for
     Heart Sound Classification,” 2023.
 [4] Yogesh Kumar, Apeksha Koul, Kamini, Marcin Woźniak, Jana Shafi, Muhammad Fazal Ijaz,
     “Automated detection and recognition system for chewable food items using advanced
     deep learning models,” 2024.
 [5] Bartosz A. Nowak, Robert K. Nowicki, Marcin Woźniak, Christian Napoli, “Multi-class
     Nearest Neighbour Classifier for Incomplete Data Handling.”
 [6] I. Rish, “An empirical study of the naive Bayes classifier,” 2001.
 [7] Harry Zhang, “The Optimality of Naive Bayes,” 2004.
 [8] Sachin Chachada, C.-C. Jay Kuo, “Environmental sound recognition: a survey,” 2014.
 [9] Michael Cowling, Renate Sitte, “Comparison of techniques for environmental sound
     recognition,” 2003.
[10] Roneel V. Sharan, Tom J. Moir, “An overview of applications and advancements in automatic
     sound recognition,” 2016.
[11] Jia-Ching Wang, Hsiao-Ping Lee, Jhing-Fa Wang, Cai-Bei Lin, “Robust Environmental
     Sound Recognition for Home Automation,” 2008.
[12] Junxin Chen, Wei Wang, Bo Fang, Yu Liu, Keping Yu, Victor C. M. Leung, Xiping Hu,
     ‘Exploiting Smartphone Voice Recording as a Digital Biomarker for Parkinson’s Disease
     Diagnosis,” 2023.
[13] Marcin Woźniak, Dawid Połap, ‘Intelligent Home Systems for Ubiquitous User Support by
     Using Neural Networks and Rule-Based Approach,” 2020.
[14] Richard Hauxwell-Baldwin, Charlie Wilson, Tom Hargreaves ‘Learning to live in a smart
     home,” 2017.
[15] Jessamyn Dahmen, Brian L. Thomas, Diane J. Cook, Xiaobo Wang, ‘Activity Learning as a
     Foundation for Security Monitoring in Smart Homes,” 2017.