<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Music Genre Recognition *</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Adam</forename><surname>Grelewicz</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Applied Mathematics</orgName>
								<orgName type="institution">Silesian University of Technology</orgName>
								<address>
									<addrLine>Kaszubska 23</addrLine>
									<postCode>44-100</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">POLAND</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Mateusz</forename><surname>Lis</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Applied Mathematics</orgName>
								<orgName type="institution">Silesian University of Technology</orgName>
								<address>
									<addrLine>Kaszubska 23</addrLine>
									<postCode>44-100</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">POLAND</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Dawid</forename><surname>Michalak</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Applied Mathematics</orgName>
								<orgName type="institution">Silesian University of Technology</orgName>
								<address>
									<addrLine>Kaszubska 23</addrLine>
									<postCode>44-100</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">POLAND</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">Information Society</orgName>
								<orgName type="institution">University Studies</orgName>
								<address>
									<addrLine>2024, May 17</addrLine>
									<settlement>Kaunas</settlement>
									<country key="LT">Lithuania</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Music Genre Recognition *</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">1FD665BA4A3DF8D09310AB4DAE1972D0</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:28+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>MFCC algorithm</term>
					<term>Genre Recognition</term>
					<term>Naive Bayes</term>
					<term>KNN</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Sound analysis plays a crucial role in identifying various types of defects in collaboration with artificial intelligence system models. To design a well-functioning model, a thorough data analysis is essential. Therefore, this article presents the implementation of the MFCC algorithm for different music genres. The algorithm is supported by high-pass and triangular filters. The recording will be transformed using the discrete Fourier transform (DFT). Then, the correctness of the algorithm will be verified using the KNN classifier and Naive Bayes to check the correct identification of the music genre. The project was conducted on a publicly available dataset. The results for the KNN classifier are very satisfactory. Additionally, this article demonstrates the superiority of the KNN classifier over Bayes for sound analysis.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Sound is a wave that arises from changes in atmospheric pressure caused by vibration <ref type="bibr" target="#b0">[1]</ref>. Combined with artificial intelligence systems, it can have broad applications in various fields. In medicine, image recognition using deep learning is utilized. In <ref type="bibr" target="#b1">[2]</ref>, models are used to help specialists diagnose diseases more quickly. It is worth noting that sound contains a lot of information. Based on sound, certain abnormalities can be detected. In <ref type="bibr" target="#b2">[3]</ref>, the use of heart sounds for early disease detection is excellently demonstrated, allowing for earlier treatment. In article <ref type="bibr" target="#b11">[12]</ref>, there is another medical application, namely recognizing people with Parkinson's disease from recorded voice samples. The average accuracy of this method is around 90%.</p><p>Therefore sound should be converted into a spectrogram, and then a model for image recognition should be used. A spectrogram is a visual representation of the intensity of a signal over time, with respect to different frequencies present in a given waveform. The evaluation of spectrograms involves transforming the signal from the time domain to the frequency domain using the Fourier transform <ref type="bibr" target="#b3">[4]</ref>. In <ref type="bibr" target="#b3">[4]</ref>, it is shown that sound can also be used in the food industry to identify various food products. In the following articles <ref type="bibr" target="#b7">[8]</ref> <ref type="bibr" target="#b8">[9]</ref>[10] <ref type="bibr" target="#b10">[11]</ref>, various techniques utilizing sound recognition are described, such as Environmental Sound Recognition (ESR) and Automatic Sound Recognition (ASR), which can be used in a smart home. A smart home, along with artificial intelligence methods, can provide support for people, reduce exploration costs, and improve energy efficiency <ref type="bibr" target="#b12">[13]</ref> <ref type="bibr" target="#b13">[14]</ref>. Therefore, this field also utilizes sound recognition. This mechanism can be used as one of the biometric security measures for homes <ref type="bibr" target="#b14">[15]</ref>. However, the sound processing scheme is the same.</p><p>All these articles demonstrate that data analysis is very important for the application of neural networks. In particular, sound must be properly processed. Sound, especially human speech or music, has certain features that can be used for its characterization, such as a unique human voice, communication method-specific noise, or the use of similar instruments in musical pieces of the same genre <ref type="bibr" target="#b0">[1]</ref>. Therefore, to extract the most important features of a sound signal in the form of a coefficient matrix, the MFCC algorithm, which will be described in detail in this article. Later in this article, there will be a comparison of two classifiers: KNN and Naive Bayes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methodology</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">The MFCC Algorithm (Mel-frequency cepstral coefficients)</head><p>Before describing the MFCC algorithm itself, certain concepts need to be defined:</p><p>1. Mel scale -a scale of pitches that measures the perceived frequency of sound, in contrast to the objective frequency scale measured in hertz.</p><p>The function for converting a frequency in hertz to the Mel scale:</p><p>The inverse function:</p><formula xml:id="formula_0">𝑚(𝑓 ) = 1125 log (1 + 𝑚 𝑓 700 ) (<label>1</label></formula><formula xml:id="formula_1">)</formula><formula xml:id="formula_2">𝑓 (𝑚) = 700(exp 1125 -1) (2)</formula><p>2. Window function is a function that takes non-zero values only within a specified interval. That functions are used to filter signals. For the purposes of mathematical description, let's introduce the following notation: Description of the MFCC algorithm:</p><p>1. Let:</p><p>• 𝑛 be the number of samples in the input signal,</p><p>• 𝑋 = {𝑥 0 , 𝑥 1 , ..., 𝑥 𝑛−1 } * be the input signal,</p><p>• ℎ be the sampling frequency of the input signal in hertz,</p><p>• 𝑡 be the number of triangular filters,</p><p>• 𝑢 be the number of numbers transformed by the discrete Fourier transform,</p><p>• 𝑙 be the length of the window in samples,</p><p>• 𝑠 be the number of samples by which the window is shifted and 0 &lt; 𝑠 ≤ 𝑙,</p><p>• 𝑐 be the number of cepstral coefficients.</p><p>2. Filters are applied to remove noise. This step is optional, but noise removal improves accuracy, so a high-pass filter is used in the form of:</p><formula xml:id="formula_3">𝑦 𝑘 = 𝑥 𝑘 − 0.97𝑥 𝑘+1 𝑘 ∈ {0, 1, 2, ..., 𝑛 − 2} *<label>(3)</label></formula><p>3. Triangular filters are created to extract desired features from the input signal while omitting unnecessary ones. These filters are distributed on a frequency scale between 0 and ℎ . Initially, the boundaries of this scale are converted from frequency in hertz to the Mel scale.</p><formula xml:id="formula_4">𝑎 = 𝑚(0) = 0 𝑏 = 𝑚( ℎ ) = 1125 log (1 + ℎ ) (<label>4</label></formula><formula xml:id="formula_5">)</formula><p>2 1400</p><p>An array of 𝑡 + 2 numbers is created, evenly distributed between 𝑎 and 𝑏.</p><formula xml:id="formula_6">𝑈 = {𝑎 + 𝑘Δ𝑥 : 𝑘 ∈ {0, 1, 2, ..., 𝑡 + 1} * } * 𝑏 − 𝑎 (5) Δ𝑥 = 𝑡 + 1</formula><p>Another array is created containing the scaled elements of array 𝑈 , converted from the Mel scale to the frequency scale in hertz and rounded down.</p><formula xml:id="formula_7">𝐵 = { ⌊︂ 𝑢 + 1 𝑓 (𝑥) ⌋︂ : 𝑥 ∈ 𝑈 } * (6)</formula><p>Array 𝐵 contains non-linearly distributed numbers from 0 to ⌊︂ 𝑢 . ⌋︂</p><formula xml:id="formula_8">2 ⎨ 𝑠 {︃ 𝑙 𝑠</formula><p>Another array is created to contain the triangular filters mentioned at the beginning.</p><formula xml:id="formula_9">𝐹 = { { 𝑔( , 𝑘 𝑗) : 𝑗 ∈ {0, 1, 2, . . . , ⌊︂ 𝑢 ⌋︁ } * } * :<label>(7)</label></formula><p>where:</p><formula xml:id="formula_10">𝑘 ∈ {0, 1, 2, . . . , 𝑡 − 1} * } * 𝑔( , 𝑘 𝑗) = ⎪ ⎪ ⎧ ⎪ ⎪ ⎩ 𝑗−𝐵[𝑘] 𝐵[𝑘+1]−𝐵[𝑘] 𝐵[𝑘+2]−𝑗 𝐵[𝑘+2]−𝐵[𝑘+1] , ⌊𝐵[𝐵]⌋︂ ≤ 𝑗 ∧ 𝐵 &lt; ⌊𝐵[𝑘 + 1]⌋︂ , ⌊︂ 𝐵[𝑘 + 1]⌋︁ ≤ 𝐵 ∧ 𝐵 &lt; ⌊𝐵[𝑘 + 2]⌋︂ 0 , for other 𝐵 4.</formula><p>The input signal is divided into windows, where a window is defined as:</p><formula xml:id="formula_11">𝑊 (𝑖) = {𝑝(𝐵) : 𝑠𝑖 ≤ 𝐵 &lt; 𝑠𝑖 + 𝑙} * 𝑖 ∈ {0, 1, 2, ..., ⌊︂ 𝑛 + 𝑠 − 𝑛 mod 𝑠 ⌋︁ − 1} *<label>(8)</label></formula><p>where:</p><p>𝑖 is the window index.</p><formula xml:id="formula_12">𝑝(𝑘) = 𝑥 𝑘 , 0 ≤ 𝑘 &lt; 𝑛 0 , 𝑛 ≤ 𝑘</formula><p>The power spectrum is calculated, i.e., the discrete Fourier transform (DFT) of the first 𝑢 elements from the array 𝑊 (𝑖), then square each number in the resulting array and scale these elements by 1 .</p><formula xml:id="formula_13">𝑃 (𝑖) = 1 𝑆(DFT(𝑊 (𝑖), 𝑢))<label>(9)</label></formula><p>𝑙 where 𝑃 (𝑖) denotes the i-th power spectrum for the i-th window 𝑊 (𝑖) and</p><formula xml:id="formula_14">𝑆({𝑠 0 , 𝑠 1 , ..., 𝑠 𝑛 } * ) = {|𝑠 0 | 2 , |𝑠 1 | 2 , ..., |𝑠 𝑛 | 2 } *</formula><p>Absolute values are required in function 𝑆 as 𝑠 𝑘 can be complex numbers.</p><p>The previously calculated filters are then utilized to filter the power spectrum via the matrix product of 𝑃 (𝑖) and the transpose of matrix 𝐹 .</p><formula xml:id="formula_15">𝐶(𝑖) = 𝑃 (𝑖)𝐹 F (10)</formula><p>The final step is to compute the natural logarithm for each element of 𝐶(𝑖) and transform these logarithms using the discrete cosine transform (DCT) of type II.</p><formula xml:id="formula_16">𝑅(𝑖) = DCT(𝐿(𝐶(𝑖)))[: 𝑐]<label>(11)</label></formula><p>where 𝐿({𝑥 0 , 𝑥 1 , ..., 𝑥 𝑛 } * ) = {ln 𝑥 0 , ln 𝑥 1 , ..., ln 𝑥 𝑛 } * .</p><p>The result of the algorithm is a matrix:</p><formula xml:id="formula_17">𝑅 = {𝑅(0), 𝑅(1), 𝑅(2), ... , 𝑅( ⌊︂ 𝑛 + 𝑠 − 𝑛 mod 𝑠 ⌋︁ − 1)} * (12) ∏︁ ∏︁</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">KNN (k-Nearest-Neighbours)</head><p>The K-Nearest Neighbors (KNN) algorithm is a classification and regression method that utilizes the similarity between data points. It operates by finding the nearest neighbors (data points) to a new point and uses their information to predict the class or value for that point <ref type="bibr" target="#b4">[5]</ref>. Before describing the KNN algorithm itself, certain concepts need to be defined: Value of k -The number of neighbors to be considered during classification or regression. Mahalanobis distance -It considers the correlations between two vectors x and y with covariance matrix S and scales distances depending on the distribution of data. It is given by the formula:</p><formula xml:id="formula_18">𝐷 𝐵 ( , 𝑥 𝑦) = √︁ (𝑥 − 𝑦) ⊤ 𝑆 −1 (𝑥 − 𝑦)<label>(13)</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Naive Bayes classifier</head><p>The Naive Bayes classifier is a machine learning method used for classifying data into decision classes. </p><formula xml:id="formula_19">𝑛 𝑑 𝑁𝐵 = arg max 𝑃 (𝑑 𝑖 ) • 𝑃 (𝑥 𝑗𝑘 | 𝑑 𝑖 )<label>(16)</label></formula><p>𝑑𝑖∈𝑉𝐷</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝑘=1</head><p>The example of MFCC algorithm will be conducted using the file classical.00000.wav.  2. After applying the high-pass filter, the samples look as follows on Figure The array 𝐵 looks as follows: As can be seen, the array 𝐵 contains numbers ranging from 0 to 𝑢 = 256.</p><note type="other">3</note><formula xml:id="formula_20">⇒ 𝐵 = {0,</formula><p>As can be seen, these numbers are not evenly distributed, meaning the differences between consecutive numbers increase as the elements progress. This is because the boundary frequencies</p><formula xml:id="formula_21">⇒ 𝑔(0, 𝑗 ) = , 1 ≤ 𝑗 &lt; 3 ⎪ ⎩ ⎪ ⎩</formula><p>𝑎 and 𝑏 were converted from the frequency scale in Hertz to the Mel scale, which is nonlinear. The reason why a change to a nonlinear scale was required will be explained later in the example.</p><p>Triangular filters are created, again, according to the formula from the description:</p><formula xml:id="formula_22">𝐹 = {{𝑔( , 𝑘 𝑗) : 𝑗 ∈ {0, 1, 2, ..., 256} * } * : 𝑘 ∈ {0, 1, 2, ..., 25} * } *</formula><p>For 𝑘 = 0 the function 𝑔(0, 𝑗) will look like:  After calculating all the values 𝑔(𝐵, 𝑗), all 𝑡 = 26 filters can be represented on a graph on figure <ref type="figure" target="#fig_6">5</ref> Due to the application of the Mel scale to distribute these filters, the highest density is at the beginning, and the lowest at the end. The reason for needing the Mel scale is that it accurately represents how humans perceive sound. It turns out that most useful information is in the lower frequencies, not the higher ones. Therefore, it makes sense to place more filters at the beginning, which was achieved by converting the frequency scale in Hertz to the Mel scale. Without this, all filters would be evenly distributed across the entire scale.</p><formula xml:id="formula_23">𝑔(0, 𝑗) = ⎧ ⎪ ⎨ ⎪ 𝑗−𝐵[0] 𝐵[1]−𝐵[0] 𝐵 [2] − 𝑗 𝐵[2]−𝐵[1] , ⌊︂ 𝐵[0]⌋︂ ≤ 𝐵 &lt; ⌊︂ 𝐵[1]⌋︂ , ⌊︂ 𝐵[1]⌋︂ ≤ 𝐵 &lt; ⌊︂ 𝐵[2]⌋︂ 0 , for other 𝐵 ⎧ ⎪ ⎨ 𝑗 , 0 ≤ 𝐵 &lt; 1 0 ,</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>{︃</head><p>It was what would happen if evenly distributed filters were used, and it degraded the accuracy by about 5%. The input signal is divided into windows.</p><formula xml:id="formula_24">𝑊 (𝑖) = {𝑝(𝑘) : 220𝑖 ≤ 𝑘 &lt; 220𝑖 + 551} * , 𝑖 ∈ {0, 1, ..., 3008} * 𝑝(𝑘) = 𝑥 𝑘 , 0 ≤ 𝑘 &lt; 661794 0 , 661794 ≤ 𝑘</formula><p>For the window 𝑖 = 0: Having the power spectral density, the matrix product 𝑃 (𝑖) and 𝐹 F is calculated, which will be the operation of filtering frequencies according to the triangular filters previously established.</p><formula xml:id="formula_25">𝑊 (0) = {𝑝(𝑘) : 0 ≤ 𝑘 &lt; 551} * ⇒ 𝑊 (0) = {𝑥 0 , 𝑥 1 , 𝑥 2 , ..., 𝑥 550 } * ⇒ 𝑊 (0) = {−</formula><formula xml:id="formula_26">𝐶(𝑖) = 𝑃 (𝑖)𝐹 F (17)</formula><p>For 𝑖 = 0 we get: Analysis of the results is conducted for 6 classes of abstraction, with the following music genres:</p><formula xml:id="formula_27">𝐶(0) = 𝑃 (0)𝐹 F ⇒ 𝐶(0) = {10695.</formula><formula xml:id="formula_28">• Classical music, • Disco, • Hip-hop, • Metal, • Blues, • Country.</formula><p>For each genre, there are 100 assigned tracks, each lasting 30 seconds. The split between training and test data is 70:30.</p><p>Before conducting a detailed analysis, it is important to determine the most effective value of k for the KNN classifier. According to the Table <ref type="table" target="#tab_5">2</ref>, it can be seen that the most effective value is k = 5, therefore this value should be adopted for the analysis. The next step is to evaluate the obtained matrices with the KNN classifier and Naive Bayes. Performance evaluation metrics such as accuracy, loss, precision, recall, and F1 score will be used to assess the effectiveness of these methods <ref type="bibr" target="#b3">[4]</ref>. These metrics are essential for evaluating the performance of machine learning models and are described by the following equations: </p><p>• TP (True Positive), which is the number of cases where the model correctly classified positive instances.   As we can see from the Table <ref type="table" target="#tab_5">2</ref>, 3, the metric values are very good for KNN with 6 classes of abstraction, whereas Naive Bayes performs significantly worse. In terms of accuracy for the entire test set, KNN achieved 80.56%, while the naive Bayes classifier achieved 36.67%. On the Figure <ref type="figure" target="#fig_10">6</ref>, 7 we observe the confusion matrix. An ideal confusion matrix has 100% on the diagonal, and the rest should be 0%. For KNN, the matrix is nearly ideal. The classifier performed worst for the disco music genre. However, for Naive Bayes, the confusion matrix does not resemble the ideal one. Nevertheless, the algorithm performed best for the metal genre.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusion</head><p>This MFCC algorithm allows for highly efficient classification of music genres with the KNN classifier. The covariance matrix effectively extracted features from the audio signal and could be used for commercial purposes such as in medicine. In the case of classifiers, KNN uses distance metrics that can be very effective in measuring similarities between musical pieces. Additionally, it does not assume any specific form of the classification function, relying instead on local similarities, which is why it worked perfectly here. On the other hand, the advantages of using the naive Bayes classifier are its simplicity and speed compared to KNN. Naive Bayes assumes that the features are independent, which is rarely true for audio data where different features can be strongly correlated. In this project, only one feature was used, namely the mean  of the sum of all elements of the matrix, which may have influenced the low accuracy compared to KNN. To achieve high accuracy, more advanced methods such as neural networks like CNN and RNN should be used. In the future, based on the matrix obtained from the MFCC algorithm, a spectrogram can be created and the given algorithm can be tested on more complex models to achieve better results.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: An example of a window function</figDesc><graphic coords="2,195.20,333.95,192.80,161.10" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>1 .</head><label>1</label><figDesc>The sampling frequency for this file is ℎ = 22050 Hz. The length of the file is 30.013 s.The number of samples is 𝑛 = 661794. The sound samples are 𝑋 = {𝑥 0 , 𝑥 1 , ..., 𝑥 𝑛−1 } * . The plot of the samples of this file is on Figure2</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Representation of samples in audio.</figDesc><graphic coords="6,185.30,137.00,202.70,162.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 : 3 .𝑈⇒</head><label>33</label><figDesc>Figure 3: That sample after it was filtered.</figDesc><graphic coords="6,185.30,445.90,202.70,162.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head></head><label></label><figDesc>for other 𝐵 The array {𝑔(0, 𝑗) : 𝐵 ∈ {0, 1, 2, ..., 256} * } * (the first element of 𝐹 ) looks like: 𝑗) : 𝑗 ∈ {0, 1, 2, ..., 256} } Representing this filter on a graph on figure 4.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Representation of a triangular filter for that sample</figDesc><graphic coords="8,208.00,446.35,168.20,140.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Representation of the triangular filterbank</figDesc><graphic coords="9,208.00,93.95,168.20,132.85" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>1 For 1 ⇒⇒</head><label>11</label><figDesc>102.19, −705.89, −136.54, ..., 123.9} * For the window 𝑖 = 1: 𝑊 (1) = {𝑝(𝑘) : 220 ≤ 𝑘 &lt; 771} * ⇒ 𝑊 (1) = {𝑥 220 , 𝑥 221 , 𝑥 222 , ..., 𝑥 770 } * For the window 𝑖 = 3008 (the last window): 𝑊 (3008) = {𝑝(𝑘) : 661760 ≤ 𝑘 &lt; 662311} * ⇒ 𝑊 (3008) = {𝑥 661760 , 𝑥 661761 , 𝑥 661762 , ..., 𝑥 661793 , 0, 0, ..., 0} * It is worth noting that for 𝑖 = 3008 the index 𝑘 goes beyond the input signal, so there are zeros at the end. Now the power spectral density is calculated: 𝑃 (0) = 𝑆(DFT({𝑥 , 𝑥 , 𝑥 , ..., 𝑥 } * , 𝑃 (0) = {476.280941, 6620.61200, 8149.37465, ..., 195.371343} *</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head></head><label></label><figDesc>2993, 31658.4727, 20555.0554, ... , 18245.6147} * Finally, the discrete cosine transform of the logarithms of 𝐶(𝑖) is calculated, taking only the first 𝑐 = 13 elements: For 𝑖 = 0 we get: 𝑅(𝑖) = DCT(𝐿(𝐶(𝑖)))[: 13] 𝑅(0) = DCT(𝐿(𝐶(0)))[: 13] ⇒ 𝑅(0) = DCT(𝐿({10695.2993, 31658.4727, 20555.0554, ..., 18245.6147} * ))[: 13] ⇒ 𝑅(0) = {62.5650537, −2.03586229, -5.32321543, ..., −1.89111160} * Finally, the array 𝑅 is obtained, which is: 𝑅 = {𝑅(0), 𝑅(1), 𝑅(2), ..., 𝑅(3008)} * Thus, the input signal is represented in the form of a matrix of cepstral coefficients.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head></head><label></label><figDesc>𝑃 + 𝑇 𝑁 𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁 Loss = 𝐹 𝑃 + 𝐹 𝑁 𝑇 𝑃 + 𝑇 𝑁 + 𝐹 𝑃 + 𝐹 𝑁</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Confusion matrix for the KNN</figDesc><graphic coords="13,209.55,101.10,157.10,125.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_11"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Confusion matrix for the naive Bayes classifier</figDesc><graphic coords="13,209.60,288.05,161.10,125.75" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>denotes an array, which is a set where elements can repeat and maintain order. An array of arrays is called a matrix.2.If 𝐴 is an array, the notation 𝐴[𝑘] means the k-th element of 𝐴. 3. If 𝐴 is an array, the notation 𝐴[: 𝑘] means the first k elements of 𝐴.</figDesc><table><row><cell>4. All other operations on arrays work similarly to operations on sets.</cell></row></table><note>2 {︃ ℎ 1. {...} *</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head></head><label></label><figDesc>Despite its simplicity, it has a wide range of applications in text classification, medical diagnosis, and system performance management. The task of the Bayes classifier is to assign a new case to one of the classes [6][7]. Each training example is described by a set of conditional attributes {𝑋 𝑖 } and one decision attribute 𝐵. According to Bayes' theorem, the most probable class to which a new object, described by the values of n-conditional attributes ⟨𝑥 𝑗1 , 𝑥 𝑗2 , . . . , 𝑥 𝑗𝑛 ⟩, belongs is the class 𝑑 𝑖 that maximizes the conditional probability 𝑃 (𝑑 𝑖 | 𝑥 𝑗1 , 𝑥 𝑗2 , . . . , 𝑥 𝑗𝑛 ). 𝑑 𝑖 ) can be estimated as the ratio of the number of training examples in class 𝑑 𝑖 for which the attribute 𝐵 𝑗𝑘 has the value 𝐵 𝑗𝑘 to the total number of training examples in class 𝐵 𝑖 . Considering this assumption, the class 𝑑 𝑁𝐵 (Naïve Bayes) chosen for a new example is:</figDesc><table /><note>𝑑 = arg max 𝑃 (𝑑 𝑖 ) • 𝑃 (𝑥 𝑗1 , 𝑥 𝑗2 , . . . , 𝑥 𝑗𝑛 | 𝑑 𝑖 ) (14) 𝑑𝑖∈𝑉𝐷 The probability 𝑃 (𝑑 𝑖 ) can be estimated as the ratio of the number of training examples belonging to class 𝑑 𝑖 to the total number of training examples. To estimate 𝑃 (𝑥 𝑗1 , 𝑥 𝑗2 , . . . , 𝑥 𝑗𝑛 | 𝑑 𝑖 ), the Naive Bayes classifier assumes the conditional independence of attributes: 𝑛 𝑃 (𝑥 𝑗1 , 𝑥 𝑗2 , . . . , 𝑥 𝑗𝑛 | 𝑑 𝑖 ) = 𝑃 (𝑥 𝑗𝑘 | 𝑑 𝑖 )(15)𝑘=1The probability 𝑃 (𝑥 𝑗𝑘 |</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 1</head><label>1</label><figDesc>Accuracy of the KNN algorithm depending on the number of neighbors.</figDesc><table><row><cell>𝑘</cell><cell>Accuracy</cell></row><row><cell>3</cell><cell>77.78%</cell></row><row><cell>4</cell><cell>78.89%</cell></row><row><cell>5</cell><cell>80.56%</cell></row><row><cell>6</cell><cell>77.78%</cell></row><row><cell>7</cell><cell>77.22%</cell></row><row><cell>8</cell><cell>77.78%</cell></row><row><cell>9</cell><cell>76.67%</cell></row><row><cell>10</cell><cell>75.56%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head></head><label></label><figDesc>• TN (True Negative), which is the number of cases where the model correctly classified negative instances.• FP (False Positive), which is the number of cases where the model incorrectly classified negative instances as positive. • FN (False Negative), which is the number of cases where the model incorrectly classified positive instances as negative.</figDesc><table><row><cell></cell><cell cols="6">Class Disco Hiphop Metal Blues Country</cell></row><row><cell>Accuracy</cell><cell>0.983</cell><cell>0.878</cell><cell>0.939</cell><cell>0.950</cell><cell>0.950</cell><cell>0.911</cell></row><row><cell>Sen-Rec</cell><cell>0.926</cell><cell>0.767</cell><cell>0.968</cell><cell>0.865</cell><cell>0.667</cell><cell>0.607</cell></row><row><cell>Precision</cell><cell>0.962</cell><cell>0.605</cell><cell>0.750</cell><cell>0.889</cell><cell>1.000</cell><cell>0.773</cell></row><row><cell>F1</cell><cell>0.943</cell><cell>0.677</cell><cell>0.845</cell><cell>0.877</cell><cell>0.800</cell><cell>0.680</cell></row><row><cell cols="2">Specificity 0.993</cell><cell>0.900</cell><cell>0.933</cell><cell>0.972</cell><cell>1.000</cell><cell>0.967</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 2</head><label>2</label><figDesc>Performance evaluation metrics for 6 genres with KNN.</figDesc><table><row><cell></cell><cell cols="6">Class Disco Hiphop Metal Blues Country</cell></row><row><cell>Accuracy</cell><cell>0.839</cell><cell>0.767</cell><cell>0.867</cell><cell>0.850</cell><cell>0.772</cell><cell>0.639</cell></row><row><cell>Sen-Rec</cell><cell>0.185</cell><cell>0.367</cell><cell>0.258</cell><cell>0.594</cell><cell>0.074</cell><cell>0.643</cell></row><row><cell>Precision</cell><cell>0.417</cell><cell>0.323</cell><cell>0.889</cell><cell>0.647</cell><cell>0.111</cell><cell>0.247</cell></row><row><cell>F1</cell><cell>0.256</cell><cell>0.344</cell><cell>0.400</cell><cell>0.620</cell><cell>0.089</cell><cell>0.356</cell></row><row><cell cols="2">Specificity 0.954</cell><cell>0.847</cell><cell>0.993</cell><cell>0.916</cell><cell>0.895</cell><cell>0.638</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 3</head><label>3</label><figDesc>Performance evaluation metrics for 6 genres with Naive Bayes .</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">− 𝑗</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">Malgorzata</forename><surname>Przedpelska-Bieniek</surname></persName>
		</author>
		<title level="m">Dzwiek i akustyka</title>
				<imprint>
			<publisher>Nauka o dzwieku</publisher>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Deep neural network correlation learning mechanism for CT brain tumor detection</title>
		<author>
			<persName><forename type="first">Marcin</forename><surname>Woźniak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jakub</forename><surname>Siłka</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michał</forename><surname>Wieczorek</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">A Robust Deep Learning Framework Based on Spectrograms for Heart Sound Classification</title>
		<author>
			<persName><forename type="first">Junxin</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Zhihuan</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xu</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Li-Bo</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yue</forename><surname>Teng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yongyong</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marcin</forename><surname>Woźniak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Wang</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Automated detection and recognition system for chewable food items using advanced deep learning models</title>
		<author>
			<persName><forename type="first">Yogesh</forename><surname>Kumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Apeksha</forename><surname>Koul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marcin</forename><surname>Kamini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jana</forename><surname>Woźniak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Muhammad</forename><forename type="middle">Fazal</forename><surname>Shafi</surname></persName>
		</author>
		<author>
			<persName><surname>Ijaz</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Multi-class Nearest Neighbour Classifier for Incomplete Data Handling</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bartosz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><forename type="middle">K</forename><surname>Nowak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Marcin</forename><surname>Nowicki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Christian</forename><surname>Woźniak</surname></persName>
		</author>
		<author>
			<persName><surname>Napoli</surname></persName>
		</author>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">An empirical study of the naive Bayes classifier</title>
		<author>
			<persName><forename type="first">I</forename><surname>Rish</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<title level="m" type="main">The Optimality of Naive Bayes</title>
		<author>
			<persName><forename type="first">Harry</forename><surname>Zhang</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Environmental sound recognition: a survey</title>
		<author>
			<persName><forename type="first">C.-C. Jay</forename><surname>Sachin Chachada</surname></persName>
		</author>
		<author>
			<persName><surname>Kuo</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">Comparison of techniques for environmental sound recognition</title>
		<author>
			<persName><forename type="first">Michael</forename><surname>Cowling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Renate</forename><surname>Sitte</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<title level="m" type="main">An overview of applications and advancements in automatic sound recognition</title>
		<author>
			<persName><forename type="first">V</forename><surname>Roneel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Tom</forename><forename type="middle">J</forename><surname>Sharan</surname></persName>
		</author>
		<author>
			<persName><surname>Moir</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Robust Environmental Sound Recognition for Home Automation</title>
		<author>
			<persName><forename type="first">Jia-Ching</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hsiao-Ping</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jhing-Fa</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Cai-Bei</forename><surname>Lin</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Exploiting Smartphone Voice Recording as a Digital Biomarker for Parkinson&apos;s Disease Diagnosis</title>
		<author>
			<persName><forename type="first">Junxin</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wei</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bo</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yu</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Keping</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">M</forename><surname>Victor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiping</forename><surname>Leung</surname></persName>
		</author>
		<author>
			<persName><surname>Hu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Intelligent Home Systems for Ubiquitous User Support by Using Neural Networks and Rule-Based Approach</title>
		<author>
			<persName><forename type="first">Marcin</forename><surname>Woźniak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dawid</forename><surname>Połap</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Learning to live in a smart home</title>
		<author>
			<persName><forename type="first">Richard</forename><surname>Hauxwell-Baldwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Charlie</forename><surname>Wilson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Tom Hargreaves</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<title level="m" type="main">Activity Learning as a Foundation for Security Monitoring in Smart Homes</title>
		<author>
			<persName><forename type="first">Jessamyn</forename><surname>Dahmen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Brian</forename><forename type="middle">L</forename><surname>Thomas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Diane</forename><forename type="middle">J</forename><surname>Cook</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Xiaobo</forename><surname>Wang</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
