<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Monostatic Acoustic Localization Using Convolutional Neural Networks</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Stef</forename><surname>Brits</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Applied Engineering</orgName>
								<orgName type="laboratory">CoSys-Lab</orgName>
								<orgName type="institution">University of Antwerp</orgName>
								<address>
									<settlement>Antwerp</settlement>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Robin</forename><surname>Kerstens</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Applied Engineering</orgName>
								<orgName type="laboratory">CoSys-Lab</orgName>
								<orgName type="institution">University of Antwerp</orgName>
								<address>
									<settlement>Antwerp</settlement>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Flanders Make Strategic Research Centre</orgName>
								<address>
									<settlement>Lommel</settlement>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<author role="corresp">
							<persName><forename type="first">Jan</forename><surname>Steckel</surname></persName>
							<email>jan.steckel@uantwerpen.be</email>
							<affiliation key="aff0">
								<orgName type="department">Faculty of Applied Engineering</orgName>
								<orgName type="laboratory">CoSys-Lab</orgName>
								<orgName type="institution">University of Antwerp</orgName>
								<address>
									<settlement>Antwerp</settlement>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Flanders Make Strategic Research Centre</orgName>
								<address>
									<settlement>Lommel</settlement>
									<country key="BE">Belgium</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Monostatic Acoustic Localization Using Convolutional Neural Networks</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">C05E403705DEABF3F8B358BF57632660</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T02:47+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Many applications require knowledge about the position of objects in a room. Popular ways to tackle this issue is to use either vision based sensors, or several communicating beacons placed at known positions which allow beamforming or triangulation methods. However, in some cases, vision is limited due to a lack of light or the presence of airborne obscurants and also the placement of several beacons can be seen as impractical. This paper suggests a method using a monostatic setup where a sensor uses a limited set of known Room Impulse Responses to then accurately estimate its position in that environment using a Regression Convolutional Neural Network. The research is performed using a Finite-difference Time-domain simulation method to generate realistic data and achieves results with an average estimation error of 14,7 cm.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>With an ever increasing demand for automation applications and technologies, localization is an issue that often needs to be handled. For outdoor situations this problem can easily be solved using (D)GPS which is known to achieve accurate results <ref type="bibr" target="#b0">[1]</ref>. However, there are many cases in which a GPS system can not be used for accurate measurements since there is a lacking line of sight (LOS) between the satellites and the object. The obstacles between the object and the satellites cause disturbances in the communication, which prohibit reliable use. These drawbacks are further described by Gonzalo Seco-Granados et al. <ref type="bibr" target="#b1">[2]</ref>. For cases like this, that are either indoors or located in heavily obscured places (e.g. mining shafts, greenhouses, ...) other solutions are required that rely on more robust techniques utilizing little infrastructure.</p><p>Employing the information found in sound waves as a base of a localization technique possesses some advantages over its alternatives. The most important properties being low cost and highly accurate indoor localization. These advantages are further established by Ureña et al. <ref type="bibr" target="#b2">[3]</ref>. Applying (ultra)sound as a localization medium can make it possible to attain an accuracy close to one centimeter, as also stated by Ureña. There have been a great number of studies that research acoustic localization and auralization. Dokmanic et al. <ref type="bibr" target="#b3">[4]</ref> research how acoustics can be used to estimate the shapes of rooms, which can be practical when researching indoor localization.</p><p>It is important to note that in many cases which implement sound for localization purposes, multiple microphones are employed. Such an infrastructure is called a microphone array. These arrays collect data by measuring incoming sound waves in a synchronous manner which can then be compared to each other. Using the concepts of the speed of sound and the way sound waves propagate, the differences between the times of arrival in the microphones make it possible to calculate the angle between the microphone array and the sound source <ref type="bibr" target="#b0">[1]</ref>. Another localization technique utilizing microphones uses synchronized static beacons. These beacons can send out sound waves which are received by the object to localize. This object then triangulates its position relative to the static beacons. This localization can be accomplished in a multitude of ways, one of which is the time-of-flight (ToF) method. When ToF is used, the time that the sound wave took to travel between the beacon and the object is used to find the location of the object. The concept of static beacons and the conventional localization methods is explained more in depth by <ref type="bibr" target="#b2">[3]</ref>.</p><p>Research in the field of sonar technology is of great interest when other forms of localization are not applicable to a situation. For example, when visual localization is employed and there is not enough light for normal cameras or there is a substantial amount of dust in the environment, as explored by Shehryar Khattak et al. <ref type="bibr" target="#b4">[5]</ref>. The dissimilarity in the approaches executed by this paper is that most sonar systems use static beacons or microphone arrays. In our research, no such infrastructure is provided and a monostatic setup is used without the use of supplementary beacons. In this paper, we will propose a machine learning approach using data obtained from finite-difference time-domain (FDTD) <ref type="bibr" target="#b5">[6]</ref> simulations. We will debate the design and research choices for the data, simulations, and lastly, the convolutional neural network (CNN) constructed for localizing with a single transceiver. To the author's knowledge, at the time of writing, the exact approach taken in paper has not yet been published in literature. The proposed method finds inspiration in popular Wi-Fi fingerprinting methods <ref type="bibr" target="#b6">[7]</ref> where pre-calculated radio maps are used to determine the location of a user. In <ref type="bibr" target="#b7">[8]</ref> a similar approach is used, but with a passive measuring scenario where environmental ultrasound is being analyzed. Vera-Diaz <ref type="bibr" target="#b8">[9]</ref> does another passive approach, tracking human speech using CNNs. The question this paper proposes to answer is: "Is it possible to localize an object inside a known room using only one sound transceiver? If it is possible, how accurate can the measurement be without the help of this additional infrastructure and which additional intelligent algorithms will be needed?".</p><p>In section 2, we will discuss the importance of a room impulse response for localization purposes in this paper. Thereafter, Section 3 contains information on the data generation employed for this research. We explain the methods of data generation and further discuss room modeling techniques. Section 4 explains the design choices of three different neural networks used to localize an object based on the simulated data. Additionally, section 5 shows the results of the networks, localizing an object in a simulated room, with a greater focus on the convolutional neural network for regression. Lastly, we conclude this paper in 6, discussing results, a mean localization error of 0,14 m, and implications of the executed research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Defining the Room Impulse Response</head><p>For an object to localize itself inside a room without external help using sound, prior knowledge about the room can be used to help the process. To obtain this knowledge, this research aims to employ a set of Room Impulse Responses (RIR). The RIR can be described as the transfer function of a room between a transmitting sound source and a receiving microphone. The object can send out a broadband signal (e.g. a sine sweep, or an Additive White Gaussian Noise (AWGN) sequence) and record, for a specified amount of time, all reflections that originate from the available surfaces in the room. This can be done by placing the transmitter and receiver on different sides of a room (bistatic) but also when they are located at the same exact position. When using this monostatic approach, the measurement forms a location-specific RIR of which the content changes as the transceiver moves through the room. As every independent position has a unique set of distances towards the reflecting surfaces, the RIR can also be expected to be unique. In our research, we will create such RIRs in using a FDTD simulation in MATLAB <ref type="bibr" target="#b5">[6]</ref>, using the same exact location for both the transmitter and the receiver. For this work, an omnidirectional transceiver is assumed.</p><p>In the current research, we always use the same room in a single dataset. With this knowledge we can state that spatial properties of the room and the relative position between sender and receiver stay the same. A lot of research has already gone into accurately detecting acoustic reflections <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11]</ref>, and it can be concluded that as the bandwidth of the measuring sequence increases, localization will become more robust to noise and will allow for a more accurate result if the frequency increases. However, because of computational needs of the FDTD simulations, this first attempt uses a sequence that limits the time required for running the simulations. A pseudo-random AWGN sequence that lasts 6 ms and that has a bandwidth between 2 kHz and 4 kHz, sampled at 10 kHz.</p><p>The main challenge proposed in this paper consists of finding the connection between the location-specific RIR and that same exact location in that room using a limited set of prior info. Antonello et al. <ref type="bibr" target="#b11">[12]</ref> describe the importance of measuring and using the RIR where an infrastructure with multiple microphones is used. This was a recurring problem when studying research because conventional approaches do not include single transceiver localization or monostatic localization, as stated by El Badawy et al <ref type="bibr" target="#b12">[13]</ref>.</p><p>One of the RIRs used for this research is depicted in Fig. <ref type="figure">1</ref>. A typical RIR consists of the measured pressure over time at a certain place in a room. In some cases, the pressure is expressed as energy in the air. Mathematically, the RIR represents the transfer function of the room between the sound source and the microphone. And because a monostatic setup is used, the RIR shows information about the location in the room. According to Cecchi et al. <ref type="bibr" target="#b13">[14]</ref>, it is possible to split the information provided by a RIR in three sections:</p><p>1. Direct sound: this is the sound measured by the first LOS transmission. This can be helpful for evaluating the transmitted sound, as it is the direct, unreflected signal. In this research, the source and receiver would be placed on the same object, a couple of centimeters apart, this means that the direct sound is measured almost instantaneously.</p><p>In simulation it is possible to measure and transmit at the exact same position. Thus, the Figure <ref type="figure">1</ref>: Graph of a typical room impulse response, discrete time on the x-axis, absolute pressure on the y-axis. Measuring for a period of 0,3 seconds allows for the capture of large amount of the reverberation that occurs after the direct reflections have been recorded. direct sound is not relevant for our simulated results, as the transmitter and receiver are positioned exactly on the same pixel. 2. Early reflections: the early reflections are the first reflected waves, created by first order reflections. In a rectangular room, this part is likely to consist of six reflections originating from the walls, floor and ceiling. Furthermore, these reflections are influenced by the directivity of the used microphone and sound source. Also the distances of the walls in relation to the object impact these early reflections. A larger distance will result in the sound waves taking longer to reach a wall and reflect back. 3. Late reflections: the late reflections consist of all measured information after direct sound and early reflections. This part of the RIR contains a large amount of noise relative to the signal, since the measured signals have dissipated over time. Nevertheless, the late reflections contain a lot of information on the rooms' spatial properties. Through multi-path propagation, late reflections could show the dimensions of the room and can aid in the aural localization in the room.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Generating Acoustic Data</head><p>Firstly, we employed the intuitive approach to measure in real life. A very basic procedure by measuring in a dorm room with a laptop (Lenovo Y520-15IKBN) was used. That laptop simultaneously sent and received a sub 20kHz sine-sweep signal sampled at 44100 Hz, based on methods used by Stan et al. <ref type="bibr" target="#b14">[15]</ref>. This method proved to be inconsistent. When conducting multiple equal tests at different days, different measurements were found. This was thought to be caused by noise drowning out the information gathered from measurements. Also, obtaining the large amount of training data that is required to train the network in this manner can be seen as cumbersome. For this reason the research would first be validated using simulated measurements, that can be run in parallel to limit the required time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Simulation Methods</head><p>Simulation is a strong, well known tool for creating substitute data. For example Vargas et al. <ref type="bibr" target="#b15">[16]</ref> showed that using simulated data could be used to train machine learning algorithms for sound recognition. Vargas et al. also noted that transfer learning could be used to expand neural networks trained on simulated data to be tested on real, measured data. Such methods of simulating are explored by Markovic et al. <ref type="bibr" target="#b16">[17]</ref> and Deines et al. <ref type="bibr" target="#b17">[18]</ref> more thoroughly. Multiple ways of simulating the acoustic properties of a room exist. These can be split in two categories.</p><p>1. Solving wave equations: This approach considers numerically solving the wave equations to find the physical properties of a room. This method is more accurate than geometrical acoustics. The drawback of this method consists of the large computational cost of solving the wave equations. 2. Geometrical acoustics (GA): The geometrical acoustics approach simplifies the acoustics modeling problem by assuming sound waves to be rays. This simplification creates the advantage of a favorably lower computational cost at the price of accuracy. Savioja et al. <ref type="bibr" target="#b18">[19]</ref> further expand upon these concepts in practice. Maa et al. <ref type="bibr" target="#b19">[20]</ref> also provide useful insights, favouring geometrical acoustics for practitioners, while favouring wave equations for theoretical studies.</p><p>Both solving wave equations as GA contain useful modeling techniques for localization and auralization purposes. The choice was made to use a wave equation solving technique due to its accurate nature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Finite-Difference Time-Domain Simulation</head><p>We used the finite-difference time-domain method for modeling the room and more precisely its reflective properties. This choice was based on multiple successful researches comparing modeling methods and preferring FDTD over multiple GA methods like ray-tracing. De Sena et al. <ref type="bibr" target="#b20">[21]</ref> and Yokota et al. <ref type="bibr" target="#b21">[22]</ref> used FDTD in comparative studies, where they show the relevance of the numerical approach to localization problems in acoustics. It could be possible to execute our research with other methods of simulation. For example, when using ray-tracing methods, a higher frequency signal could have been simulated, as explored by Vasiou et al. <ref type="bibr" target="#b22">[23]</ref>.</p><p>For this research, it is more important to know what FDTD does than knowing how the calculations are performed. For a detailed mathematical definition of FDTD, we refer to <ref type="bibr" target="#b23">[24]</ref> by Schneider, in chapter twelve, the topic of acoustics is discussed separately. FDTD calculates the 'next' state of the pressure field based on all previous states. It calculates every next state of the field, given the previous states. Which in turn gives a result that is close to what is expected to be measured in practice. Using this algorithm we could define a room in MATLAB and calculate the pressure (sound) fields over a time of 0,3 seconds. 0,3 seconds is sufficiently long, as we can deduce that 0,3 seconds of measurements simulate paths over 100 meters long, inside a room of 10 by 10 meters. A snapshot of the FDTD simulation is depicted in Fig. <ref type="figure">2</ref>, where the transceiver is located at the center of the visible wavefront, and the room is shown as the borders of the plot, with a small amount of reflectors located at the left portion of the room, breaking the symmetry of the room. The figure shows 170 by 170 nodes, as opposed to the expected 10 meters by 10 meters. This effect appeared due to the FDTD simulations requiring space to be discretized for calculating the sound waves in the room. The coordinates in the room that contained the location of the object were chosen to be random values in the xy-plane. Every simulation consisted of 0,3 seconds of sampling at a random location, with the signal source at the same location, simulating the transceiver.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Using Neural Networks for Localization</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Fully Connected Neural Networks</head><p>The first simulation results consisted of 3000 samples, measuring pressure at the location of the object, which will be seen as the room impulse response (RIR). Along with the RIR, the ground  truth location was added for every simulation to later use as a label in the neural networks. The location contains the x-, and y-coordinate in the simulated room. 250 simulations were constructed based on the same room with the same additive white Gaussian noise pulse. These simulations were the first real dataset that could be used as input for the fully connected neural network. It was important to use a simple network at first, for clarity reasons and to know that there is indeed the possibility to extract locations from the simulated data.</p><p>The main goal in this step was not to produce a precise network, but to produce an accurate network. This meant that the importance lies in the consistence of localization guesses, not the precision of those guesses. For this we designed a simple, fully connected classification network that had a high accuracy in contrast to its small dataset of 500 simulations, as seen in the confusion matrix in Fig. <ref type="figure" target="#fig_4">6</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Classification Convolutional Neural Networks</head><p>After designing the first fully connected network, it could be remarked that training a fully connected network on the time series data would be less efficient than employing the frequency domain counterpart of the data. To achieve this, the fast Fourier transform (FFT) helped in making images (spectrograms) out of the time series RIR, which was also done in <ref type="bibr" target="#b7">[8]</ref>. The dimensions of those spectrogram images change as the bin size used to store them changes, where a higher bin size or resolution contains more information. The downside of using a higher resolution is higher computational load. In the end, the original matrix was a complex 257 by 208 matrix. That matrix was split up into two channels. With the first channel being the amplitude and the second channel containing the phase. By using the spectrogram, the amplitude and phase could be used as input features. A downside of using this type of network is that the localization output is limited to a fixed set of outcomes, which drastically limits the potential accuracy obtained by the system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Regression Convolutional Neural Networks</head><p>The final iteration of the algorithm needed to perform monostatic localization consisted of a convolutional network with three convolution layers, each followed by a ReLU and normalization layer. The structure of this network can be seen in Fig. <ref type="figure" target="#fig_2">5</ref>. The input consists of the same type of spectrogram used in the classification type of network. The benefit of using a regression network is that the output is not limited to a fixed set of predetermined outcomes, but returns a set of coordinate estimations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Overfitting</head><p>Overfitting was a recurring obstacle during the execution of this research. It is well known that overfitting is bound to be a problem in every research involving (convolutional) neural networks. The frequent occurrence of over-fitting makes it so that large quantities of different methods exist to counteract the over-fitting problems. As an example, Srivastava et al. <ref type="bibr" target="#b24">[25]</ref> use dropout layers in deep neural networks. For this research, a multitude of methods were used to counteract overfitting. Firstly, we used lower initial learning rates to stop the weights from reaching their end values too fast. If the learning rate is too high, the network will learn too many features from the training data and will overfit. Additionally, we used larger datasets, containing 3000 simulations, helping restrain the overfitting. If a network uses more (diverse) data during training, it is intuitive that the network will learn less trivial, wrong features. Also the switch from classification to regression CNNs made overfitting occur after less epochs, so in a later stage of the training stage. Since using a binary output 'left' and 'right' generates no difference in interpreting 'far left' or 'close left', while regression generates an exact coordinate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5.">Designing the Layers and Hyperparameters</head><p>When designing layers for a neural network, two general starting points can be chosen. The first option is to make a minimalist network which has as little complexity as possible, slowly adding complexity until the desired specifications are reached. The other starting option is the  exact opposite, starting with an especially complex network and whittling down the complexity until reaching the desired results.</p><p>In this research, we employed a combination of these two approaches. The model on which the first CNN was based, was a classification network made for lung sound analysis available within our lab. This was a more complex network then what was needed for this application but it laid the ground work for spectrogram images as input data. This meant using the second starting point, a complex neural network that can be whittled down into a usable network. We lowered the number of weights and biases by changing the kernel and stride sizes. The art of designing layers for neural networks relies on trial-and-error, as explored by Suganuma et al. <ref type="bibr" target="#b25">[26]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results and Discussion</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Fully Connected Neural Network Results</head><p>After making a small dataset of 100 RIRs consisting of 3000 samples. The first, fully connected network using time-domain data as input, was able to be trained and tested. The earliest results were then reached by testing the network and plotting the confusion matrix on the limited dataset. This is illustrated in Fig. <ref type="figure" target="#fig_4">6</ref>.</p><p>As seen in Fig. <ref type="figure" target="#fig_1">4a</ref>, the output consists of guessing whether the object is LEFT or RIGHT in the room. Which corresponds to the left hand side and the right hand side of the simulated room. The room is split up in left and right by dividing the x coordinate in two and deciding the border at that x value. Fig. <ref type="figure" target="#fig_4">6</ref> shows the confusion matrix of that network, with zero corresponding to LEFT and one corresponding to right. The network is capable of estimating the rough location of the transceiver with an accuracy of 81%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Classification With Convolutional Neural Network</head><p>The results from the first versions of the convolutional network were not optimized to a useful degree. The added complexity and small dataset made classification harder. The used network was too complex for the amount of data available. Overfitting was substantially big that learning would quickly halt. This did not mean that making this network was in vain, the goal of the classification network was so that the regression network had firm fundamentals. We learned the importance of the relation between complexity of the network and the amount of available data. Also ways of lowering learning rates were learned while minimizing overfitting. This network served as a stepping stone to the next results, as this network validated the possibility to use spectrogram images of the RIR data to perform rough localization. This paved the way for the third version of the network that adds regression to perform more accurate estimations.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Regression With Convolutional Neural Network</head><p>The benefit of using regression instead of classification is that regression trains continuous variables instead of specified labels. This makes it more suitable for applications such as this one, where two parameters need to be estimated accurately. In Fig. <ref type="figure">8</ref> an error histogram is shown that shows the estimation error distribution between the true locations and location estimates by the most accurate network we could design and train during this research. Note that the error function is the euclidean distance between the two points. The simulated room wherein these predictions took place is visible in Fig. <ref type="figure">2</ref>. The data is extracted out of 300 position estimates and shows that the majority of estimations have an estimation error below 20 cm, with a total average of 14,7 cm. In Fig. <ref type="figure">7</ref> a single example of a set of estimated coordinates, generated using the network shown in Fig. <ref type="figure" target="#fig_2">5</ref>, is depicted. The room is displayed within the boundaries of the plot. The results are promising and encourage future research on this topic. The height dimension could be added in a later iteration of this research, but was not deemed relevant for the current application.</p><p>To contribute to the robustness of the research, tests were performed on the trained networks used for the last results. The goal was to show that the localization is not random and is far more precise and accurate than random guessing. When using the same 300 simulations used  in Fig. <ref type="figure">8</ref>, random guesses resulted in a mean error distance of 4,0961 meter. This proves that the results discussed, with a mean error of 0,1473 meters in the same environment, are more accurate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and Discussion</head><p>The research question that was handled in this paper was the following: "Is it possible to localize an object inside a known room using only one sound transceiver? If it is possible, how accurate can the measurement be without the help of this additional infrastructure and which additional intelligent algorithms will be needed?" This paper came to the conclusion that it is indeed possible to accurately localize an object in a room, simulated using finite-difference time-domain numerical techniques. Different types of networks were tested, starting with a classification approach where rough estimates about the position of the transceiver were made based on time-domain recordings. To improve accuracy and extract more information out of the time-domain data, the research switched to working with spectrogram images of the recorded data, which made it possible to use convolutional neural networks. To reach a mean accuracy of under 15 cm, a regression convolutional neural network was needed. This network was trained on more than 2000 different spectrograms of room impulse responses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Future Work</head><p>This work was constrained to use only one microphone/transceiver. This made the research interesting, differing from conventional auralization and localization research. Future research may consist of transferring the simulated networks to real life scenarios. Bianco et al. <ref type="bibr" target="#b26">[27]</ref> suggest the use of transfer learning and make a summary of multiple successful studies reaching accurate localization in real world scenarios by employing transfer learning. Due to limitations in time and computational power, the research in this paper was forced to use a sub-optimal measuring sequence for this type of application. For future research, it may also be useful have a measuring sequence with a larger bandwidth and explore the use of coded emissions which would allow multiple objects to be tracked simultaneously. Also the influence of the transceiver beam pattern should be examined. We hope that future research may build upon the concepts and approaches formed in this study.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :Figure 3 :</head><label>23</label><figDesc>Figure 2: A 2-dimensional projection of an example 3-D FDTD simulation. The wavefront originates from the location of the acoustic transceiver, where also the RIR will be recorded after the signal has passed through the environment.</figDesc><graphic coords="5,193.47,502.10,166.68,125.05" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Illustrations of the three different neural networks produced for this paper. Notice that the (a) utilizes time-domain data, while the other networks use both the time and frequency information present in the spectrogram images.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: The structure of the regression Convolutional Neural Network used to obtain the final results in this paper.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Illustration of the confusion matrix of the first fully connected network which made rough estimations on the transceiver being on either the left side, or the right side of the room. with zero corresponding to 'left' and one corresponding to 'right'. The obtained accuracy is 81%.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head></head><label></label><figDesc>CNN, green is the location, red is the prediction</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 7 :Figure 8 :</head><label>78</label><figDesc>Figure 7: Example of a set of coordinates, estimated by the regression CNN when using the spectrogram or the measured RIR as input. Ground truth coordinate: [6,9 m; 8,4 m], estimated coordinate: [7,0183 m; 85,744 m]</figDesc><graphic coords="11,193.47,334.72,166.68,125.05" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Siciliano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Khatib</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-540-30301-5</idno>
		<ptr target="http://link.springer.com/10.1007/978-3-540-30301-5.doi:10.1007/978-3-540-30301-5" />
		<title level="m">Springer Handbook of Robotics</title>
				<meeting><address><addrLine>Berlin Heidelberg; Berlin, Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Challenges in indoor global navigation satellite systems: Unveiling its core features in signal processing</title>
		<author>
			<persName><forename type="first">G</forename><surname>Seco-Granados</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>López-Salcedo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jiménez-Baños</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>López-Risueño</surname></persName>
		</author>
		<idno type="DOI">10.1109/MSP.2011.943410</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Signal Processing Magazine</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="108" to="131" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Acoustic local positioning with encoded emission beacons</title>
		<author>
			<persName><forename type="first">J</forename><surname>Urena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Hernandez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">J</forename><surname>García</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Villadangos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Gualda</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><forename type="middle">J</forename><surname>Álvarez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Aguilera</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the IEEE</title>
		<imprint>
			<biblScope unit="volume">106</biblScope>
			<biblScope unit="page" from="1042" to="1062" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Acoustic echoes reveal room shape</title>
		<author>
			<persName><forename type="first">I</forename><surname>Dokmanić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Parhizkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Walther</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><forename type="middle">M</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Vetterli</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Proceedings of the National Academy of Sciences</title>
		<imprint>
			<biblScope unit="volume">110</biblScope>
			<biblScope unit="page" from="12186" to="12191" />
			<date type="published" when="2013">2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<title level="m" type="main">Visual-thermal landmarks and inertial fusion for navigation in degraded visual environments</title>
		<author>
			<persName><forename type="first">S</forename><surname>Khattak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Papachristos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Alexis</surname></persName>
		</author>
		<idno>CoRR abs/1903.01656</idno>
		<ptr target="http://arxiv.org/abs/1903.01656.arXiv:1903.01656" />
		<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Acoustic Absorbers and Diffusers</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">J</forename><surname>Cox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Antonio</surname></persName>
		</author>
		<idno type="DOI">10.4324/9781482266412</idno>
		<ptr target="http://arxiv.org/abs/1011.1669http://dx.doi.org/10.1088/1751-8113/44/8/085201.doi:10.4324/9781482266412" />
		<imprint>
			<date type="published" when="2009">2009</date>
			<biblScope unit="volume">4</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">WiFi fingerprint indoor positioning system using probability distribution comparison</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">Le</forename><surname>Dortz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Gain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Zetterberg</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICASSP.2012.6288374</idno>
	</analytic>
	<monogr>
		<title level="m">ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing -Proceedings</title>
				<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="2301" to="2304" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Indoor localization based on analysis of environmental ultrasound</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Nagama</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Umezawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Osawa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IPIN</title>
		<title level="s">Short Papers/Work-in-Progress Papers</title>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="423" to="430" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">M</forename><surname>Vera-Diaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Pizarro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Macias-Guarasa</surname></persName>
		</author>
		<idno type="DOI">10.3390/s18103418</idno>
		<ptr target="https://pubmed.ncbi.nlm.nih.gov/30322007/.doi:10.3390/s18103418.arXiv:1807.11094" />
	</analytic>
	<monogr>
		<title level="j">Sensors</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Optimization of the receive filter and transmit sequence for active sensing</title>
		<author>
			<persName><forename type="first">P</forename><surname>Stoica</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<idno type="DOI">10.1109/TSP.2011.2179652</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE Transactions on Signal Processing</title>
		<imprint>
			<biblScope unit="volume">60</biblScope>
			<biblScope unit="page" from="1730" to="1740" />
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Stoica</surname></persName>
		</author>
		<title level="m">Waveform design for active sensing systems: a computational approach</title>
				<imprint>
			<publisher>Cambridge University Press</publisher>
			<date type="published" when="2012">2012</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Room impulse response interpolation using a sparse spatio-temporal representation of the sound field</title>
		<author>
			<persName><forename type="first">N</forename><surname>Antonello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">De</forename><surname>Sena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Moonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">A</forename><surname>Naylor</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Van Waterschoot</surname></persName>
		</author>
		<idno type="DOI">10.1109/TASLP.2017.2730284</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="page" from="1929" to="1941" />
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Direction of arrival with one microphone, a few legos, and nonnegative matrix factorization</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">El</forename><surname>Badawy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Dokmanić</surname></persName>
		</author>
		<idno type="DOI">10.1109/TASLP.2018.2867081</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">26</biblScope>
			<biblScope unit="page" from="2436" to="2446" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Room response equalization-a review</title>
		<author>
			<persName><forename type="first">S</forename><surname>Cecchi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Carini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Spors</surname></persName>
		</author>
		<idno type="DOI">10.3390/app8010016</idno>
		<ptr target="https://doi.org/10.3390/app8010016.doi:10.3390/app8010016" />
	</analytic>
	<monogr>
		<title level="j">Applied Sciences</title>
		<imprint>
			<biblScope unit="volume">8</biblScope>
			<biblScope unit="page">16</biblScope>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Comparison of different impulse response measurement techniques</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Stan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Embrechts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of the Audio Engineering Society</title>
		<imprint>
			<biblScope unit="volume">50</biblScope>
			<biblScope unit="page" from="249" to="262" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
	<note>AES</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">On improved training of cnn for acoustic source localisation</title>
		<author>
			<persName><forename type="first">E</forename><surname>Vargas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">R</forename><surname>Hopgood</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Subr</surname></persName>
		</author>
		<idno type="DOI">10.1109/TASLP.2021.3049337</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="720" to="732" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Three-dimensional point-cloud room model in room acoustics simulations</title>
		<author>
			<persName><forename type="first">M</forename><surname>Markovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">K</forename><surname>Olesen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hammersho/I</surname></persName>
		</author>
		<idno type="DOI">10.1121/1.4806371</idno>
		<ptr target="https://sfx.aub.aau.dk/sfxaub?sid=pureportal&amp;doi=10.1121/1.4806371.doi:10.1121/1.4806371" />
		<imprint>
			<date type="published" when="2013">2013</date>
			<biblScope unit="volume">133</biblScope>
			<biblScope unit="page" from="3532" to="3532" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Comparative visualization for wave-based and geometric acoustics</title>
		<author>
			<persName><forename type="first">E</forename><surname>Deines</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hering-Bertram</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mohring</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Jegorovs</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Oberste-Dommes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Nielson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Visualization and Computer Graphics</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="1173" to="1180" />
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
	<note>IEEE Transactions on</note>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Overview of geometrical room acoustic modeling techniques</title>
		<author>
			<persName><forename type="first">L</forename><surname>Savioja</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><forename type="middle">P</forename><surname>Svensson</surname></persName>
		</author>
		<idno type="DOI">10.1121/1.4926438</idno>
		<ptr target="https://doi.org/10.1121/1.4926438.doi:10.1121/1.4926438" />
	</analytic>
	<monogr>
		<title level="j">The Journal of the Acoustical Society of America</title>
		<imprint>
			<biblScope unit="volume">138</biblScope>
			<biblScope unit="page" from="708" to="730" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">The flutter echoes</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">Y</forename><surname>Maa</surname></persName>
		</author>
		<idno type="DOI">10.1121/1.1916161</idno>
		<ptr target="https://doi.org/10.1121/1.1916161.doi:10.1121/1.1916161" />
	</analytic>
	<monogr>
		<title level="j">The Journal of the Acoustical Society of America</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="170" to="178" />
			<date type="published" when="1941">1941</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">On the modeling of rectangular geometries in room acoustic simulations</title>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">De</forename><surname>Sena</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Antonello</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Moonen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Van Waterschoot</surname></persName>
		</author>
		<idno type="DOI">10.1109/TASLP.2015.2405476</idno>
	</analytic>
	<monogr>
		<title level="j">IEEE/ACM Transactions on Audio, Speech, and Language Processing</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="774" to="786" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<monogr>
		<title level="m" type="main">Comparison of room impulse response calculated by the simulation methods based on geometrical acoustics and wave acoustics</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">H</forename><surname>Yokota</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Sakamoto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename></persName>
		</author>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="2715" to="2716" />
		</imprint>
		<respStmt>
			<orgName>Institute of Industrial and Science, University of Tokyo</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">A detailed study of ray tracing performance: render time and energy cost</title>
		<author>
			<persName><forename type="first">E</forename><surname>Vasiou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Shkurko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Mallett</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Brunvand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Yuksel</surname></persName>
		</author>
		<idno type="DOI">10.1007/s00371-018-1532-8</idno>
	</analytic>
	<monogr>
		<title level="j">The Visual Computer</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<title level="m" type="main">Understanding the Finite-Difference Time-Domain Method</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">B</forename><surname>Schneider</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
			<pubPlace>Washington</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Dropout: A simple way to prevent neural networks from overfitting</title>
		<author>
			<persName><forename type="first">N</forename><surname>Srivastava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Hinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Krizhevsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Salakhutdinov</surname></persName>
		</author>
		<ptr target="http://jmlr.org/papers/v15/srivastava14a.html" />
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="1929" to="1958" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">A genetic programming approach to designing convolutional neural network architectures</title>
		<author>
			<persName><forename type="first">M</forename><surname>Suganuma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shirakawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nagao</surname></persName>
		</author>
		<idno type="DOI">10.1145/3071178.3071229</idno>
		<idno>doi:10.1145/3071178.3071229</idno>
		<ptr target="https://doi.org/10.1145/3071178.3071229" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Genetic and Evolutionary Computation Conference, GECCO &apos;17</title>
				<meeting>the Genetic and Evolutionary Computation Conference, GECCO &apos;17<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="497" to="504" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Machine learning in acoustics: Theory and applications</title>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Bianco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gerstoft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Traer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ozanich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Roch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gannot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-A</forename><surname>Deledalle</surname></persName>
		</author>
		<idno type="DOI">10.1121/1.5133944</idno>
		<idno>doi:</idno>
		<ptr target="10.1121/1.5133944" />
	</analytic>
	<monogr>
		<title level="j">The Journal of the Acoustical Society of America</title>
		<imprint>
			<biblScope unit="volume">146</biblScope>
			<biblScope unit="page" from="3590" to="3628" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
