<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Detection of Defective Speech Using Convolutional Neural Networks</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Mikhail</forename><surname>Belenko</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">ITMO University</orgName>
								<address>
									<addrLine>Kronverksky Pr. 49, bldg. A</addrLine>
									<postCode>197101</postCode>
									<settlement>St. Petersburg</settlement>
									<country key="RU">Russian Federation</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Nikita</forename><surname>Burym</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">ITMO University</orgName>
								<address>
									<addrLine>Kronverksky Pr. 49, bldg. A</addrLine>
									<postCode>197101</postCode>
									<settlement>St. Petersburg</settlement>
									<country key="RU">Russian Federation</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pavel</forename><surname>Balakshin</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">ITMO University</orgName>
								<address>
									<addrLine>Kronverksky Pr. 49, bldg. A</addrLine>
									<postCode>197101</postCode>
									<settlement>St. Petersburg</settlement>
									<country key="RU">Russian Federation</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Detection of Defective Speech Using Convolutional Neural Networks</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">C2A7FFE9CBC51CC9D4F5B2203520907F</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T00:35+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Speech recognition</term>
					<term>Defective speech</term>
					<term>Convolutional Neural Network</term>
					<term>Convolutional Deep Belief Network</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper presents an algorithm for detecting a pathological voice. It is shown that the convolutional neural network effectively extracts features from the spectrograms of voice recordings and diagnoses voice disorders. The deep belief convolutional network helps to initialize weights and makes the system more reliable. The effect of the size of convolutional network filters on each layer on the system performance is also studied.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Automatic detection of pathological voice disorders, such as paralysis of the vocal cords or Reinke's edema, is a complex and important problem of medical classification. While deep learning methods have made significant progress in speech recognition, fewer studies have been conducted in the detection of pathological voice disorders. This paper presents a new system of pathological voice recognition using convolutional neural network (CNN) as the basic architecture. The new system uses spectrograms of normal and abnormal speech recordings as input to the network. Initially, the deep belief convolutional network (CDBN) is used to pretrain CNN weights. It acts as a generative model for studying the structure of input data using statistical methods. CNN then uses training with controlled back propagation to adjust the weights. As a result, it is clear that a small amount of data can be used to achieve good results in classification using this approach. The performance analysis of this method is performed using real data from the SaarbruckenVoice database.</p><p>Voice pathologies affect the larynx and lead to irregular fluctuations in the vocal folds. This leads to psychological and physiological problems for individuals, and also has a significant impact on the economy, taking into account the costs of medical diagnosis and treatment. The traditional method of diagnosing voice pathology relies on the experience of a doctor and on expensive devices such as a laryngoscope, endoscope, etc. However, computer-based medical systems for diagnosing voice pathologies are becoming popular due to significant advances in signal processing technologies. These comprehensive tools are usually non-invasive and non-subjective, which is generally an advantage in the medical field <ref type="bibr" target="#b0">[1]</ref>.</p><p>Over the past few decades, many scientific works have been carried out related to the automatic detection of voice pathologies. Usually, these features are extracted from speech recordings and then processed by classifiers to distinguish normal speech from pathological speech. Signs are mainly derived from two areas of research. One of them is related to speech recognition applications, where signal processing tools are used to automatically detect signal properties such as Mel-frequency cepstral coefficients (MFCC), linear predictive cepstral coefficients (LPCC), and the energy and entropy of discrete wavelet packets <ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref>.</p><p>Other signs come from measuring voice quality in accordance with physiological and etiological studies. While pitch, jitter, and flicker are used to determine the depth of speech, other characteristics such as harmonic-to-noise ratio (HNR), normalized noise energy (NNE), laryngealto-noise ratio (LNR), and cepstral peak prominence (CPP) represent speech hoarseness <ref type="bibr" target="#b4">[5]</ref>. Most research papers use the Massachusetts Eye and year Infirmary (MEEI) database.However, healthy voice recordings and abnormal voice recordings in this database are recorded in two different environments <ref type="bibr" target="#b5">[6]</ref>, which makes it difficult to distinguish whether these are discriminating environments or voice features.The Saarbruecken Voice Database is a downloadable database with all recordings sampled at 50 kHz and 16-bit resolution. This database is relatively new, So little research has been done on it. However, the recordings are recorded in the same environment, so it was decided to choose it for this study.</p><p>Modern signal processing techniques previously used in the field of speech recognition have also made significant progress in the field of automatic detection of abnormal voice. For example, in <ref type="bibr" target="#b6">[7]</ref>, the Russian language Gaussian Mixture Model (GMM) based on the Saarbruecken voice database is used, and 67% classification accuracy is achieved with a neutral stable vowel /a/. However, with the increasing computing capabilities of hardware and the improvement of machine learning algorithms, the Markov model hidden in the deep neural network gradually replaces the traditional GMM-HMM <ref type="bibr" target="#b7">[8]</ref> and becomes a popular method of speech recognition. To date, deep learning methods are not commonly used in the field of pathological voice detection, mainly due to the limited amount of data, since DNN requires a large amount of data for training. In <ref type="bibr" target="#b8">[9]</ref>, a restricted Boltzmann machine (RBM) is proposed as an unsupervised method for pre-training DNN to accurately achieve global minima. As a generative model, it improves deep learning performance even on small datasets. Deep belief convolutional networks (CDBNS) were proposed in <ref type="bibr" target="#b9">[10]</ref> as an advanced specific structure for CNN pre-training. This article considers a new deep learning method for automatic detection of abnormal voice. In this paper, we use the CNN convolutional neural network structure for automatic analysis of speech recording spectrograms. CDBN is used for pre-training weights and preventing problems with over-training. A similar approach is proposed in <ref type="bibr" target="#b10">[11]</ref>, but the influence of convolutional neural network parameters is left behind in that study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methodology</head><p>Figure <ref type="figure" target="#fig_1">1</ref> shows a block diagram of the proposed system for detecting abnormal voice. First, preprocessing is applied to speech recordings, which includes resampling and shape-changing methods. Then a short-time Fourier transform (STFT) is applied to obtain speech recording spectrograms as input to the CNN system. Weights in the CNN system are pre-trained using CDBN and adjusted using the back propagation method. The trained CNN system is able to automatically extract features and classify audio samples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Input data</head><p>One of the properties of CNN is the ability to reduce the dimension of two-dimensional feature maps. Therefore, speech recordings are converted from one-dimensional signals to twodimensional spectrograms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1.">Dataset</head><p>This paper uses the Saarbruecken voice database, which was registered by the Institute of phonetics of the Saarland University in Germany. This database contains 71 different pathologies with speech recordings from more than 2000 people. Each participant's file contains recordings of the sustained vowels /a/, /i/, and /u/ inneutral, low, high, and low-high-low intonations, and the continuous speech sentence "Guten Morgen, wie geht as Ihnen?" ("Good morning, How are you?"). Stable vowels are used in this work because they are stationary in time and it is easier to see changes.</p><p>The following pathologies were selected as the pathological group • laryngitis • leukoplakia • Reinke's edema • paralysis of the recurrent laryngeal nerve • carcinoma of the vocal folds • polyps of the vocal fold.</p><p>All these pathologies are organic dysphonia, which are caused by structural changes in the vocal cord. The vowel /a/ is used at a neutral height for each individual, of which 482 are healthy and 482 are diagnosed with pathologies (140 laryngitis, 41 leukoplakia, 68 Reinke's edema, 213 recurrent laryngeal nerve paralysis, 22 vocal fold carcinoma and 45 vocal fold polyps).The data is divided into a training set and a test set containing 75% and 25% of the samples, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.2.">Pretraining</head><p>The source speech is encoded at a frequency of 25 kHz for the pre-processing stage. The goal of this step is to reduce the amount of data in the feature map to speed up the learning process. In addition, STFT is used to convert a time domain signal to a spectral domain signal. At this stage, each file is divided into 10ms of Hamming window segments with 50% overlap between consecutive Windows. Finally, the spectrogram is changed to the same size of 60*155 points to get rid of the useless part that doesn't contain any information. In this case, useless noise is discarded and significant signs appear.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">CNN architecture</head><p>CNN Is represented by an input layer and several hidden layers. Each individual layer consists of a convolutional layer 𝐻 and a merging layer 𝑉 . The input feature map is defined as 𝑉 𝑙 (𝑙 = 1, ..., 𝐿), and the convolutional feature map is defined as 𝐻 𝑘 (𝑘 = 1, ..., 𝐾). The filter weights are common to all units on the convolutional layer, calculated as,</p><formula xml:id="formula_0">ℎ 𝑘 𝑚 = 𝜎( 𝐼 ∑︁ 𝑙=1 𝑁 𝑊 ∑︁ 𝑛=1 𝑣 𝑙,𝑛+𝑚−1 𝑤 𝑘 𝑙,𝑛 + 𝑤 𝑘 0 )<label>(1)</label></formula><p>where 𝑣 𝑙,𝑚 element of the m-th unit of l-th input layer 𝑉 , and ℎ 𝑘 𝑚 element of m-th block of the k-th convolutional layer 𝐻. 𝑁 𝑤 is defined as the size of the filters, 𝑤 𝑘 𝑙,𝑛 is n-th unit of weight and +𝑤 𝑘 0 is the 0-th unit of weight. In this procedure, objects are detected locally and automatically using shared weights across the feature map.</p><p>To reduce resolution in convolutional plys and reduce computational complexity, a union of convolutional maps is used. The maximization or averaging function is usually used to build the unifying layer. In this case, set 𝐺 as the size of the merging window using the maximize function, and the element on the merging layer is defined as, </p><p>where 𝑠 is the step of the merging window moving in the convolutional layer and other variables are defined above.</p><p>The experimental network shown on figure 2 contains 10 hidden layers. In the first hidden layer, the filter size is 8*3, and the step is 1. The size of the merging window is 4*4 and step 1. After the first hidden layer, each layer was collapsed by 8 filters with the shape 8*3*8 and step 1. The size of the unifying windows is 4*4 and the RELU activation function for the entire neural network. Finally, the feature map is formed into a dense layer (a fully connected layer) to train the classification model. L2 regularization is used to solve the problem of retraining. Parameters such as pitch, size of filters in each layer, and the number of layers can be changed and should be selected depending on the signal features used. In this paper, we also studied networks with the configurations shown in table 1. The rectangular filter window is used because of the specific characteristics of the spectrograms. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Preprocessing</head><p>Deep learning is a "black box" that requires a large amount of data and processes to adjust the weight. In turn, Bayesian methods are reliable and interpretable on small amounts of data, which is exactly what deep learning methods lack.</p><p>To combine the complementary advantages of these two methods, generative models have been developed to improve the effectiveness of deep learning on small data sets and eliminate overfitting problems. In deep learning structures, a section of the weight space is detected by a generative model, which helps the network quickly converge to a global minimum. The convolutional restricted Boltzmann machine (CRBM) is a typical generative model and is an extension of RBM with visible and hidden layers as images that is suitable for CNN settings. The model is trained to reach a state of thermal equilibrium, which is the deepest energy minimum state. In this state, hidden layers can model the structure of input data.</p><p>The CRBM consists of two layers: the visible (input) layer 𝑉 and the hidden (convolutional) layer 𝐻. Similar to the CNN setting, the weights 𝑊 𝑘 between the input layer and the convolutional layer are distributed among all elements in the hidden layer. Hidden elements are binary, while visible elements can be real or binary. Assume that the size of the visible layer is 𝑁 𝑉 , and the size of the hidden layer is 𝑁 𝐻 . There are 𝐾 weights and each weight 𝑊 𝑘 is collapsed with the visible layer, and there is an offset 𝑏 𝑘 for each weight and an offset 𝑐 for the visible layer. An energy function with a binary input is defined as, </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>Proceedings of the 12th Majorov International Conference onSoftware Engineering and Computer Systems, December 10-11, 2020, Online Saint Petersburg, Russia 0000-0002-5060-1512 (M. Belenko); 0000-0002-4343-6408 (N. Burym); 0000-0003-1916-9546 (P. Balakshin)</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: System architecture.</figDesc><graphic coords="3,89.29,84.19,416.71,268.88" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Network architecture.</figDesc><graphic coords="5,93.64,84.19,408.00,348.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Configuration of the studied networks</figDesc><table><row><cell cols="2">Configuration Input layer</cell><cell>Hidden layers</cell></row><row><cell>Proposed</cell><cell>Convolutional: 8*3*1 Pooling: 4*4*1</cell><cell>Convolutional: 8*3*8 Pooling: 4*4*1</cell></row><row><cell>Big filters</cell><cell>Convolutional: 16*6*1 Pooling: 8*8*1</cell><cell>Convolutional: 16*3*16 Pooling: 8*8*1</cell></row><row><cell>Small filters</cell><cell>Convolutional: 4*2*1 Pooling: 2*2*1</cell><cell>Convolutional: 16*3*16 Pooling: 2*2*1</cell></row></table></figure>
		</body>
		<back>
			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results</head><p>Sensitivity shows the effectiveness of detecting abnormal voice files, and specificity shows the proportion of correctly detected healthy voice files. The accuracy (P) and F1-score (F1) are presented below, where the accuracy shows the proportion of the corresponding pathological voice files.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>𝑆𝑁 =</head><p>𝑇 𝑃 𝑇 𝑃 + 𝐹 𝑁 (6)</p><p>True negative (TN) means that healthy voice recordings are correctly identified. True positive (TP) means that abnormal voice recordings are correctly identified. False-negative (FN) indicates that abnormal voice recordings are detected incorrectly and false-positive (FP) indicates that voice recordings were detected incorrectly.</p><p>There is also a difference in the operation of the CT system with and without pre-training. When using a CDN to initialize weights, the CNN setup becomes more reliable, with similar performance for the custom data set and the test data set. This shows that the CDBN can avoid overfitting problems to some extent. However, the accuracy on the test dataset is less when using pre-trained CDBN weights.</p><p>Similarly, the CRBM is trained using Gibbs block sampling <ref type="bibr" target="#b9">[10]</ref> as an extension of the Gibbs sampling in RBM to maximize the similarity of the distribution between the construction visible layer and the input visible layer, and in this case achieve an equilibrium state. The stacks from the CRBM make up the CDBN. After the first CRBM layer is trained, activations are sent to the input of subsequent layers and the weights are " frozen", and the remaining layers are processed in the same way. Since the visible layer in the first layer works with real data, Gaussian visible units are used for the first CRBM layer. After pre-training the weights in each layer, reverse propagation is applied to fine-tune the weights for a better classification result. Testing results are shown in tables 2 and 3. </p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Occupational risks for voice problems</title>
		<author>
			<persName><forename type="first">K</forename><surname>Verdolini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">O</forename><surname>Ramig</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Logopedics Phoniatrics Vocology</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="37" to="46" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Feature analysis for automatic detection of pathological speech</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">A</forename><surname>Dibazar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Narayanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">W</forename><surname>Berger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Joint 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society</title>
				<meeting>the Second Joint 24th Annual Conference and the Annual Fall Meeting of the Biomedical Engineering Society</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="182" to="183" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">An optimum algorithm in pathological voice quality assessment using wavelet-packet-based features, linear discriminant analysis and support vector machine</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">K</forename><surname>Arjmandi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pooyan</surname></persName>
		</author>
		<idno>/01/01/ 2012</idno>
	</analytic>
	<monogr>
		<title level="j">Biomedical Signal Processing and Control</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="3" to="19" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A new feature constituting approach to detection of vocal fold pathology</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hariharan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Polat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Yaacob</surname></persName>
		</author>
		<idno>/08/03 2014</idno>
	</analytic>
	<monogr>
		<title level="j">International Journal of Systems Science</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="issue">8</biblScope>
			<biblScope unit="page" from="1622" to="1634" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">An Investigation of Multidimensional Voice Program Parameters in Three Different Databases for Voice Pathology Detection and Classification</title>
		<author>
			<persName><forename type="first">A</forename><surname>Al-Nasheri</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Voice</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">e18</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Voice pathology detection using interlaced derivative pattern on glottal source excitation</title>
		<author>
			<persName><forename type="first">G</forename><surname>Muhammad</surname></persName>
		</author>
		<idno>/01/01/ 2017</idno>
	</analytic>
	<monogr>
		<title level="j">Biomedical Signal Processing and Control</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page" from="156" to="164" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Voice Pathology Detection on the Saarbrücken Voice Database with Calibration and Fusion of Scores Using MultiFocal Toolkit</title>
		<author>
			<persName><forename type="first">D</forename><surname>Martínez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Lleida</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ortega</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Miguel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Villalba</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in Speech and Language Technologies for Iberian Languages: Iber-SPEECH 2012 Conference</title>
				<editor>
			<persName><forename type="first">D</forename><forename type="middle">Torre</forename><surname>Proceedings</surname></persName>
		</editor>
		<editor>
			<persName><surname>Toledano</surname></persName>
		</editor>
		<meeting><address><addrLine>Madrid, Spain; Berlin, Heidelberg; Berlin Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2012">November 21-23, 2012. 2012</date>
			<biblScope unit="page" from="99" to="109" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Convolutional Neural Networks for Speech Recognition</title>
		<author>
			<persName><forename type="first">O</forename><surname>Abdel-Hamid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">R</forename><surname>Mohamed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Penn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">22</biblScope>
			<biblScope unit="issue">10</biblScope>
			<biblScope unit="page" from="1533" to="1545" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups</title>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Signal Processing Magazine</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="issue">6</biblScope>
			<biblScope unit="page" from="82" to="97" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations</title>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Grosse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ranganath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">Y</forename><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">presented at the Proceedings of the 26th Annual International Conference on Machine Learning</title>
				<meeting><address><addrLine>Montreal, Quebec, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">A deep learning method for pathological voice detection using convolutional deep belief networks //Interspeech</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wu</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2018">2018. 2018</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
