<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Neural Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stef Brits</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robin Kerstens</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jan Steckel</string-name>
          <email>jan.steckel@uantwerpen.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CoSys-Lab, Faculty of Applied Engineering, University of Antwerp</institution>
          ,
          <addr-line>Antwerp</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Flanders Make Strategic Research Centre</institution>
          ,
          <addr-line>Lommel</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Many applications require knowledge about the position of objects in a room. Popular ways to tackle this issue is to use either vision based sensors, or several communicating beacons placed at known positions which allow beamforming or triangulation methods. However, in some cases, vision is limited due to a lack of light or the presence of airborne obscurants and also the placement of several beacons can be seen as impractical. This paper suggests a method using a monostatic setup where a sensor uses a limited set of known Room Impulse Responses to then accurately estimate its position in that environment using a Regression Convolutional Neural Network. The research is performed using a Finite-diference Time-domain simulation method to generate realistic data and achieves results with an average estimation error of 14,7 cm.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        With an ever increasing demand for automation applications and technologies, localization
is an issue that often needs to be handled. For outdoor situations this problem can easily be
solved using (D)GPS which is known to achieve accurate results [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, there are many
cases in which a GPS system can not be used for accurate measurements since there is a lacking
line of sight (LOS) between the satellites and the object. The obstacles between the object and
the satellites cause disturbances in the communication, which prohibit reliable use. These
drawbacks are further described by Gonzalo Seco-Granados et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. For cases like this, that
are either indoors or located in heavily obscured places (e.g. mining shafts, greenhouses, ...)
other solutions are required that rely on more robust techniques utilizing little infrastructure.
      </p>
      <p>
        Employing the information found in sound waves as a base of a localization technique
possesses some advantages over its alternatives. The most important properties being low cost
and highly accurate indoor localization. These advantages are further established by Ureña
et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Applying (ultra)sound as a localization medium can make it possible to attain an
accuracy close to one centimeter, as also stated by Ureña. There have been a great number of
studies that research acoustic localization and auralization. Dokmanic et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] research how
acoustics can be used to estimate the shapes of rooms, which can be practical when researching
indoor localization.
CEUR
Workshop
Proceedings
© 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
It is important to note that in many cases which implement sound for localization purposes,
multiple microphones are employed. Such an infrastructure is called a microphone array.
These arrays collect data by measuring incoming sound waves in a synchronous manner which
can then be compared to each other. Using the concepts of the speed of sound and the way
sound waves propagate, the diferences between the times of arrival in the microphones make
it possible to calculate the angle between the microphone array and the sound source [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Another localization technique utilizing microphones uses synchronized static beacons. These
beacons can send out sound waves which are received by the object to localize. This object then
triangulates its position relative to the static beacons. This localization can be accomplished in
a multitude of ways, one of which is the time-of-flight (ToF) method. When ToF is used, the
time that the sound wave took to travel between the beacon and the object is used to find the
location of the object. The concept of static beacons and the conventional localization methods
is explained more in depth by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Research in the field of sonar technology is of great interest when other forms of localization
are not applicable to a situation. For example, when visual localization is employed and
there is not enough light for normal cameras or there is a substantial amount of dust in the
environment, as explored by Shehryar Khattak et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The dissimilarity in the approaches
executed by this paper is that most sonar systems use static beacons or microphone arrays. In
our research, no such infrastructure is provided and a monostatic setup is used without the
use of supplementary beacons. In this paper, we will propose a machine learning approach
using data obtained from finite-diference time-domain (FDTD) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] simulations. We will
debate the design and research choices for the data, simulations, and lastly, the convolutional
neural network (CNN) constructed for localizing with a single transceiver. To the author’s
knowledge, at the time of writing, the exact approach taken in paper has not yet been published
in literature. The proposed method finds inspiration in popular Wi-Fi fingerprinting methods
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] where pre-calculated radio maps are used to determine the location of a user. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] a similar
approach is used, but with a passive measuring scenario where environmental ultrasound is
being analyzed. Vera-Diaz [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] does another passive approach, tracking human speech using
CNNs. The question this paper proposes to answer is: “Is it possible to localize an object
inside a known room using only one sound transceiver? If it is possible, how accurate can
the measurement be without the help of this additional infrastructure and which additional
intelligent algorithms will be needed?”.
      </p>
      <p>In section 2, we will discuss the importance of a room impulse response for localization
purposes in this paper. Thereafter, Section 3 contains information on the data generation
employed for this research. We explain the methods of data generation and further discuss
room modeling techniques. Section 4 explains the design choices of three diferent neural
networks used to localize an object based on the simulated data. Additionally, section 5 shows
the results of the networks, localizing an object in a simulated room, with a greater focus on
the convolutional neural network for regression. Lastly, we conclude this paper in 6, discussing
results, a mean localization error of 0,14 m, and implications of the executed research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Defining the Room Impulse Response</title>
      <p>
        For an object to localize itself inside a room without external help using sound, prior knowledge
about the room can be used to help the process. To obtain this knowledge, this research aims
to employ a set of Room Impulse Responses (RIR). The RIR can be described as the transfer
function of a room between a transmitting sound source and a receiving microphone. The
object can send out a broadband signal (e.g. a sine sweep, or an Additive White Gaussian Noise
(AWGN) sequence) and record, for a specified amount of time, all reflections that originate from
the available surfaces in the room. This can be done by placing the transmitter and receiver on
diferent sides of a room (bistatic) but also when they are located at the same exact position.
When using this monostatic approach, the measurement forms a location-specific RIR of which
the content changes as the transceiver moves through the room. As every independent position
has a unique set of distances towards the reflecting surfaces, the RIR can also be expected to be
unique. In our research, we will create such RIRs in using a FDTD simulation in MATLAB
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], using the same exact location for both the transmitter and the receiver. For this work, an
omnidirectional transceiver is assumed.
      </p>
      <p>
        In the current research, we always use the same room in a single dataset. With this knowledge
we can state that spatial properties of the room and the relative position between sender and
receiver stay the same. A lot of research has already gone into accurately detecting acoustic
reflections [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], and it can be concluded that as the bandwidth of the measuring sequence
increases, localization will become more robust to noise and will allow for a more accurate result
if the frequency increases. However, because of computational needs of the FDTD simulations,
this first attempt uses a sequence that limits the time required for running the simulations. A
pseudo-random AWGN sequence that lasts 6 ms and that has a bandwidth between 2 kHz and 4
kHz, sampled at 10 kHz.
      </p>
      <p>
        The main challenge proposed in this paper consists of finding the connection between the
location-specific RIR and that same exact location in that room using a limited set of prior
info. Antonello et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] describe the importance of measuring and using the RIR where an
infrastructure with multiple microphones is used. This was a recurring problem when studying
research because conventional approaches do not include single transceiver localization or
monostatic localization, as stated by El Badawy et al [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        One of the RIRs used for this research is depicted in Fig. 1. A typical RIR consists of the
measured pressure over time at a certain place in a room. In some cases, the pressure is
expressed as energy in the air. Mathematically, the RIR represents the transfer function of the
room between the sound source and the microphone. And because a monostatic setup is used,
the RIR shows information about the location in the room. According to Cecchi et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], it is
possible to split the information provided by a RIR in three sections:
1. Direct sound: this is the sound measured by the first LOS transmission. This can be
helpful for evaluating the transmitted sound, as it is the direct, unreflected signal. In
this research, the source and receiver would be placed on the same object, a couple of
centimeters apart, this means that the direct sound is measured almost instantaneously.
In simulation it is possible to measure and transmit at the exact same position. Thus, the
direct sound is not relevant for our simulated results, as the transmitter and receiver are
positioned exactly on the same pixel.
2. Early reflections: the early reflections are the first reflected waves, created by first order
reflections. In a rectangular room, this part is likely to consist of six reflections originating
from the walls, floor and ceiling. Furthermore, these reflections are influenced by the
directivity of the used microphone and sound source. Also the distances of the walls in
relation to the object impact these early reflections. A larger distance will result in the
sound waves taking longer to reach a wall and reflect back.
3. Late reflections: the late reflections consist of all measured information after direct sound
and early reflections. This part of the RIR contains a large amount of noise relative
to the signal, since the measured signals have dissipated over time. Nevertheless, the
late reflections contain a lot of information on the rooms’ spatial properties. Through
multi-path propagation, late reflections could show the dimensions of the room and can
aid in the aural localization in the room.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Generating Acoustic Data</title>
      <p>
        Firstly, we employed the intuitive approach to measure in real life. A very basic procedure
by measuring in a dorm room with a laptop (Lenovo Y520-15IKBN) was used. That laptop
simultaneously sent and received a sub 20kHz sine-sweep signal sampled at 44100 Hz, based on
methods used by Stan et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. This method proved to be inconsistent. When conducting
multiple equal tests at diferent days, diferent measurements were found. This was thought to
be caused by noise drowning out the information gathered from measurements. Also, obtaining
the large amount of training data that is required to train the network in this manner can be
seen as cumbersome. For this reason the research would first be validated using simulated
measurements, that can be run in parallel to limit the required time.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Simulation Methods</title>
        <p>
          Simulation is a strong, well known tool for creating substitute data. For example Vargas et
al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] showed that using simulated data could be used to train machine learning algorithms
for sound recognition. Vargas et al. also noted that transfer learning could be used to expand
neural networks trained on simulated data to be tested on real, measured data. Such methods of
simulating are explored by Markovic et al. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] and Deines et al. [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] more thoroughly. Multiple
ways of simulating the acoustic properties of a room exist. These can be split in two categories.
1. Solving wave equations: This approach considers numerically solving the wave equations
to find the physical properties of a room. This method is more accurate than geometrical
acoustics. The drawback of this method consists of the large computational cost of solving
the wave equations.
2. Geometrical acoustics (GA): The geometrical acoustics approach simplifies the acoustics
modeling problem by assuming sound waves to be rays. This simplification creates the
advantage of a favorably lower computational cost at the price of accuracy. Savioja et
al. [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] further expand upon these concepts in practice. Maa et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] also provide
useful insights, favouring geometrical acoustics for practitioners, while favouring wave
equations for theoretical studies.
        </p>
        <p>Both solving wave equations as GA contain useful modeling techniques for localization and
auralization purposes. The choice was made to use a wave equation solving technique due to
its accurate nature.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Finite-Diference Time-Domain Simulation</title>
        <p>We used the finite-diference time-domain method for modeling the room and more precisely
its reflective properties. This choice was based on multiple successful researches comparing
modeling methods and preferring FDTD over multiple GA methods like ray-tracing. De Sena et
1–14
-80
-140
-150
50
100</p>
        <p>
          150
Time (ms)
200
250
al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and Yokota et al. [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ] used FDTD in comparative studies, where they show the relevance
of the numerical approach to localization problems in acoustics. It could be possible to execute
our research with other methods of simulation. For example, when using ray-tracing methods,
a higher frequency signal could have been simulated, as explored by Vasiou et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ].
For this research, it is more important to know what FDTD does than knowing how the
calculations are performed. For a detailed mathematical definition of FDTD, we refer to [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] by
Schneider, in chapter twelve, the topic of acoustics is discussed separately. FDTD calculates the
‘next’ state of the pressure field based on all previous states. It calculates every next state of the
ifeld, given the previous states. Which in turn gives a result that is close to what is expected to
be measured in practice. Using this algorithm we could define a room in MATLAB and calculate
the pressure (sound) fields over a time of 0,3 seconds. 0,3 seconds is suficiently long, as we can
deduce that 0,3 seconds of measurements simulate paths over 100 meters long, inside a room of
10 by 10 meters. A snapshot of the FDTD simulation is depicted in Fig. 2, where the transceiver
is located at the center of the visible wavefront, and the room is shown as the borders of the
plot, with a small amount of reflectors located at the left portion of the room, breaking the
symmetry of the room. The figure shows 170 by 170 nodes, as opposed to the expected 10
meters by 10 meters. This efect appeared due to the FDTD simulations requiring space to
be discretized for calculating the sound waves in the room. The coordinates in the room that
contained the location of the object were chosen to be random values in the xy-plane. Every
simulation consisted of 0,3 seconds of sampling at a random location, with the signal source at
the same location, simulating the transceiver.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Using Neural Networks for Localization</title>
      <sec id="sec-4-1">
        <title>4.1. Fully Connected Neural Networks</title>
        <p>The first simulation results consisted of 3000 samples, measuring pressure at the location of the
object, which will be seen as the room impulse response (RIR). Along with the RIR, the ground
50 x (1x3000) RIRs</p>
        <p>(validation)
200 x (1x3000) RIRs
(training)</p>
        <p>Fully Connected</p>
        <p>Neural Network
truth location was added for every simulation to later use as a label in the neural networks.
The location contains the x-, and y-coordinate in the simulated room. 250 simulations
were constructed based on the same room with the same additive white Gaussian noise
pulse. These simulations were the first real dataset that could be used as input for the fully
connected neural network. It was important to use a simple network at first, for clarity
reasons and to know that there is indeed the possibility to extract locations from the simulated data.</p>
        <p>The main goal in this step was not to produce a precise network, but to produce an accurate
network. This meant that the importance lies in the consistence of localization guesses, not
the precision of those guesses. For this we designed a simple, fully connected classification
network that had a high accuracy in contrast to its small dataset of 500 simulations, as seen in
the confusion matrix in Fig. 6.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Classification Convolutional Neural Networks</title>
        <p>
          After designing the first fully connected network, it could be remarked that training a fully
connected network on the time series data would be less eficient than employing the frequency
domain counterpart of the data. To achieve this, the fast Fourier transform (FFT) helped in
making images (spectrograms) out of the time series RIR, which was also done in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. The
dimensions of those spectrogram images change as the bin size used to store them changes,
where a higher bin size or resolution contains more information. The downside of using a
higher resolution is higher computational load. In the end, the original matrix was a complex
257 by 208 matrix. That matrix was split up into two channels. With the first channel being
the amplitude and the second channel containing the phase. By using the spectrogram, the
amplitude and phase could be used as input features. A downside of using this type of network
is that the localization output is limited to a fixed set of outcomes, which drastically limits the
potential accuracy obtained by the system.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Regression Convolutional Neural Networks</title>
        <p>The final iteration of the algorithm needed to perform monostatic localization consisted of a
convolutional network with three convolution layers, each followed by a ReLU and normalization
layer. The structure of this network can be seen in Fig. 5. The input consists of the same type
of spectrogram used in the classification type of network. The benefit of using a regression
network is that the output is not limited to a fixed set of predetermined outcomes, but returns a
set of coordinate estimations.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Overfitting</title>
        <p>
          Overfitting was a recurring obstacle during the execution of this research. It is well known
that overfitting is bound to be a problem in every research involving (convolutional) neural
networks. The frequent occurrence of over-fitting makes it so that large quantities of diferent
methods exist to counteract the over-fitting problems. As an example, Srivastava et al. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] use
dropout layers in deep neural networks. For this research, a multitude of methods were used
to counteract overfitting. Firstly, we used lower initial learning rates to stop the weights from
reaching their end values too fast. If the learning rate is too high, the network will learn too
many features from the training data and will overfit. Additionally, we used larger datasets,
containing 3000 simulations, helping restrain the overfitting. If a network uses more (diverse)
data during training, it is intuitive that the network will learn less trivial, wrong features. Also
the switch from classification to regression CNNs made overfitting occur after less epochs, so
in a later stage of the training stage. Since using a binary output ‘left’ and ‘right’ generates no
diference in interpreting ‘far left’ or ‘close left’, while regression generates an exact coordinate.
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Designing the Layers and Hyperparameters</title>
        <p>When designing layers for a neural network, two general starting points can be chosen. The
ifrst option is to make a minimalist network which has as little complexity as possible, slowly
adding complexity until the desired specifications are reached. The other starting option is the
RIR Spectrogram
input (257,208,2)
Convolutional
2D Layer
ReLU</p>
        <p>Convolutional
2D Layer</p>
        <p>ReLU
Batch Normalization</p>
        <p>Layer
Batch Normalization</p>
        <p>Layer</p>
        <p>Convolutional
2D Layer</p>
        <p>ReLU
Batch Normalization</p>
        <p>Layer
Fully Connected</p>
        <p>Layer
Regression</p>
        <p>Layer
exact opposite, starting with an especially complex network and whittling down the complexity
until reaching the desired results.</p>
        <p>
          In this research, we employed a combination of these two approaches. The model on which
the first CNN was based, was a classification network made for lung sound analysis available
within our lab. This was a more complex network then what was needed for this application
but it laid the ground work for spectrogram images as input data. This meant using the second
starting point, a complex neural network that can be whittled down into a usable network. We
lowered the number of weights and biases by changing the kernel and stride sizes. The art of
designing layers for neural networks relies on trial-and-error, as explored by Suganuma et al.
[
          <xref ref-type="bibr" rid="ref26">26</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <sec id="sec-5-1">
        <title>5.1. Fully Connected Neural Network Results</title>
        <p>After making a small dataset of 100 RIRs consisting of 3000 samples. The first, fully connected
network using time-domain data as input, was able to be trained and tested. The earliest results
were then reached by testing the network and plotting the confusion matrix on the limited
dataset. This is illustrated in Fig. 6.</p>
        <p>As seen in Fig. 4a, the output consists of guessing whether the object is LEFT or RIGHT in the
room. Which corresponds to the left hand side and the right hand side of the simulated room.
The room is split up in left and right by dividing the x coordinate in two and deciding the border
at that x value. Fig. 6 shows the confusion matrix of that network, with zero corresponding to
LEFT and one corresponding to right. The network is capable of estimating the rough location
of the transceiver with an accuracy of 81%.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Classification With Convolutional Neural Network</title>
        <p>The results from the first versions of the convolutional network were not optimized to a useful
degree. The added complexity and small dataset made classification harder. The used network
was too complex for the amount of data available. Overfitting was substantially big that learning
would quickly halt. This did not mean that making this network was in vain, the goal of the
classification network was so that the regression network had firm fundamentals. We learned
the importance of the relation between complexity of the network and the amount of available
data. Also ways of lowering learning rates were learned while minimizing overfitting. This
network served as a stepping stone to the next results, as this network validated the possibility
to use spectrogram images of the RIR data to perform rough localization. This paved the way
for the third version of the network that adds regression to perform more accurate estimations.</p>
        <p>30
30.0%</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Regression With Convolutional Neural Network</title>
        <p>The benefit of using regression instead of classification is that regression trains continuous
variables instead of specified labels. This makes it more suitable for applications such as this
one, where two parameters need to be estimated accurately. In Fig. 8 an error histogram is
shown that shows the estimation error distribution between the true locations and location
estimates by the most accurate network we could design and train during this research. Note
that the error function is the euclidean distance between the two points. The simulated room
wherein these predictions took place is visible in Fig. 2. The data is extracted out of 300 position
estimates and shows that the majority of estimations have an estimation error below 20 cm,
with a total average of 14,7 cm.</p>
        <p>In Fig. 7 a single example of a set of estimated coordinates, generated using the network shown
in Fig. 5, is depicted. The room is displayed within the boundaries of the plot. The results are
promising and encourage future research on this topic. The height dimension could be added
in a later iteration of this research, but was not deemed relevant for the current application.
To contribute to the robustness of the research, tests were performed on the trained networks
used for the last results. The goal was to show that the localization is not random and is far
more precise and accurate than random guessing. When using the same 300 simulations used
10 Prediction result of CNN, green is the location, red is the prediction
in Fig. 8, random guesses resulted in a mean error distance of 4,0961 meter. This proves that
the results discussed, with a mean error of 0,1473 meters in the same environment, are more
accurate.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Discussion</title>
      <p>The research question that was handled in this paper was the following: “Is it possible to
localize an object inside a known room using only one sound transceiver? If it is possible, how
accurate can the measurement be without the help of this additional infrastructure and which
additional intelligent algorithms will be needed?” This paper came to the conclusion that it
is indeed possible to accurately localize an object in a room, simulated using finite-diference
time-domain numerical techniques. Diferent types of networks were tested, starting with a
classification approach where rough estimates about the position of the transceiver were made
based on time-domain recordings. To improve accuracy and extract more information out of the
time-domain data, the research switched to working with spectrogram images of the recorded
data, which made it possible to use convolutional neural networks. To reach a mean accuracy of
under 15 cm, a regression convolutional neural network was needed. This network was trained
on more than 2000 diferent spectrograms of room impulse responses.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Future Work</title>
      <p>
        This work was constrained to use only one microphone/transceiver. This made the research
interesting, difering from conventional auralization and localization research. Future research
may consist of transferring the simulated networks to real life scenarios. Bianco et al. [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ]
suggest the use of transfer learning and make a summary of multiple successful studies reaching
accurate localization in real world scenarios by employing transfer learning. Due to limitations
in time and computational power, the research in this paper was forced to use a sub-optimal
measuring sequence for this type of application. For future research, it may also be useful have
a measuring sequence with a larger bandwidth and explore the use of coded emissions which
would allow multiple objects to be tracked simultaneously. Also the influence of the transceiver
beam pattern should be examined. We hope that future research may build upon the concepts
and approaches formed in this study.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Siciliano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Khatib</surname>
          </string-name>
          , Springer Handbook of Robotics, Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2008</year>
          . URL: http://link.springer.com/10.1007/978-3-
          <fpage>540</fpage>
          -30301-5. doi:
          <volume>10</volume>
          .1007/ 978- 3-
          <fpage>540</fpage>
          - 30301- 5.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Seco-Granados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>López-Salcedo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiménez-Baños</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>López-Risueño</surname>
          </string-name>
          ,
          <article-title>Challenges in indoor global navigation satellite systems: Unveiling its core features in signal processing</article-title>
          ,
          <source>IEEE Signal Processing Magazine</source>
          <volume>29</volume>
          (
          <year>2012</year>
          )
          <fpage>108</fpage>
          -
          <lpage>131</lpage>
          . doi:
          <volume>10</volume>
          .1109/MSP.
          <year>2011</year>
          .
          <volume>943410</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Urena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Villadangos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Carmen</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gualda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. J.</given-names>
            <surname>Álvarez</surname>
          </string-name>
          , T. Aguilera,
          <article-title>Acoustic local positioning with encoded emission beacons</article-title>
          ,
          <source>Proceedings of the IEEE</source>
          <volume>106</volume>
          (
          <year>2018</year>
          )
          <fpage>1042</fpage>
          -
          <lpage>1062</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>I. Dokmanić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Parhizkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Walther</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. M.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vetterli</surname>
          </string-name>
          ,
          <article-title>Acoustic echoes reveal room shape</article-title>
          ,
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>110</volume>
          (
          <year>2013</year>
          )
          <fpage>12186</fpage>
          -
          <lpage>12191</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Khattak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Papachristos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Alexis</surname>
          </string-name>
          ,
          <article-title>Visual-thermal landmarks and inertial fusion for navigation in degraded visual environments</article-title>
          , CoRR abs/
          <year>1903</year>
          .01656 (
          <year>2019</year>
          ). URL: http: //arxiv.org/abs/
          <year>1903</year>
          .01656. arXiv:
          <year>1903</year>
          .01656.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Cox</surname>
          </string-name>
          , P. D'Antonio, Acoustic Absorbers and Difusers, volume
          <volume>4</volume>
          ,
          <year>2009</year>
          . URL: http: //arxiv.org/abs/1011.1669http://dx.doi.org/10.1088/
          <fpage>1751</fpage>
          -8113/44/8/085201. doi:
          <volume>10</volume>
          .4324/ 9781482266412.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N. Le</given-names>
            <surname>Dortz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gain</surname>
          </string-name>
          , P. Zetterberg,
          <article-title>WiFi fingerprint indoor positioning system using probability distribution comparison</article-title>
          , ICASSP, IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing - Proceedings</source>
          (
          <year>2012</year>
          )
          <fpage>2301</fpage>
          -
          <lpage>2304</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2012</year>
          .
          <volume>6288374</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nagama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Umezawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Osawa</surname>
          </string-name>
          ,
          <article-title>Indoor localization based on analysis of environmental ultrasound, in: IPIN (Short Papers/Work-in-</article-title>
          <source>Progress Papers)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>423</fpage>
          -
          <lpage>430</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Vera-Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pizarro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Macias-Guarasa</surname>
          </string-name>
          ,
          <article-title>Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates</article-title>
          ,
          <source>Sensors (Switzerland) 18</source>
          (
          <year>2018</year>
          ). URL: https://pubmed.ncbi.nlm.nih.gov/30322007/. doi:
          <volume>10</volume>
          .3390/s18103418. arXiv:
          <year>1807</year>
          .11094.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Stoica</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Optimization of the receive filter and transmit sequence for active sensing</article-title>
          ,
          <source>IEEE Transactions on Signal Processing</source>
          <volume>60</volume>
          (
          <year>2012</year>
          )
          <fpage>1730</fpage>
          -
          <lpage>1740</lpage>
          . doi:
          <volume>10</volume>
          .1109/TSP.
          <year>2011</year>
          .
          <volume>2179652</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stoica</surname>
          </string-name>
          ,
          <article-title>Waveform design for active sensing systems: a computational approach</article-title>
          , Cambridge University Press,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Antonello</surname>
          </string-name>
          , E. De Sena,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moonen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Naylor</surname>
          </string-name>
          , T. van Waterschoot,
          <article-title>Room impulse response interpolation using a sparse spatio-temporal representation of the sound field</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>25</volume>
          (
          <year>2017</year>
          )
          <fpage>1929</fpage>
          -
          <lpage>1941</lpage>
          . doi:
          <volume>10</volume>
          .1109/TASLP.
          <year>2017</year>
          .
          <volume>2730284</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>El Badawy</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Dokmanić</surname>
          </string-name>
          ,
          <article-title>Direction of arrival with one microphone, a few legos, and nonnegative matrix factorization</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>26</volume>
          (
          <year>2018</year>
          )
          <fpage>2436</fpage>
          -
          <lpage>2446</lpage>
          . doi:
          <volume>10</volume>
          .1109/TASLP.
          <year>2018</year>
          .
          <volume>2867081</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cecchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Carini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Spors</surname>
          </string-name>
          ,
          <article-title>Room response equalization-a review</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>8</volume>
          (
          <year>2017</year>
          )
          <article-title>16</article-title>
          . URL: https://doi.org/10.3390/app8010016. doi:
          <volume>10</volume>
          .3390/app8010016.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>A. D. Stan</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Embrechts</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <article-title>Comparison of diferent impulse response measurement techniques</article-title>
          ,
          <source>AES: Journal of the Audio Engineering Society</source>
          <volume>50</volume>
          (
          <year>2002</year>
          )
          <fpage>249</fpage>
          -
          <lpage>262</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Vargas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Hopgood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Brown</surname>
          </string-name>
          , K. Subr,
          <article-title>On improved training of cnn for acoustic source localisation</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>29</volume>
          (
          <year>2021</year>
          )
          <fpage>720</fpage>
          -
          <lpage>732</lpage>
          . doi:
          <volume>10</volume>
          .1109/TASLP.
          <year>2021</year>
          .
          <volume>3049337</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Markovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Olesen</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Hammersho/i, Three-dimensional point-cloud room model in room acoustics simulations 133 (</article-title>
          <year>2013</year>
          )
          <fpage>3532</fpage>
          -
          <lpage>3532</lpage>
          . URL: https://sfx.aub.aau.dk/sfxaub? sid=pureportal&amp;doi=10.1121/1.4806371. doi:
          <volume>10</volume>
          .1121/1.4806371.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E.</given-names>
            <surname>Deines</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hering-Bertram</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mohring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jegorovs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Oberste-Dommes</surname>
          </string-name>
          , G. Nielson,
          <article-title>Comparative visualization for wave-based and geometric acoustics, Visualization and Computer Graphics</article-title>
          , IEEE Transactions on
          <volume>12</volume>
          (
          <year>2006</year>
          )
          <fpage>1173</fpage>
          -
          <lpage>1180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Savioja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. P.</given-names>
            <surname>Svensson</surname>
          </string-name>
          ,
          <article-title>Overview of geometrical room acoustic modeling techniques</article-title>
          ,
          <source>The Journal of the Acoustical Society of America</source>
          <volume>138</volume>
          (
          <year>2015</year>
          )
          <fpage>708</fpage>
          -
          <lpage>730</lpage>
          . URL: https://doi.org/ 10.1121/1.4926438. doi:
          <volume>10</volume>
          .1121/1.4926438.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D. Y.</given-names>
            <surname>Maa</surname>
          </string-name>
          ,
          <article-title>The flutter echoes</article-title>
          ,
          <source>The Journal of the Acoustical Society of America</source>
          <volume>13</volume>
          (
          <year>1941</year>
          )
          <fpage>170</fpage>
          -
          <lpage>178</lpage>
          . URL: https://doi.org/10.1121/1.1916161. doi:
          <volume>10</volume>
          .1121/1.1916161.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>E. De Sena</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Antonello</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Moonen</surname>
          </string-name>
          , T. van Waterschoot,
          <article-title>On the modeling of rectangular geometries in room acoustic simulations</article-title>
          ,
          <source>IEEE/ACM Transactions on Audio, Speech, and Language Processing</source>
          <volume>23</volume>
          (
          <year>2015</year>
          )
          <fpage>774</fpage>
          -
          <lpage>786</lpage>
          . doi:
          <volume>10</volume>
          .1109/TASLP.
          <year>2015</year>
          .
          <volume>2405476</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>T. H. Yokota</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sakamoto</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <article-title>Comparison of room impulse response calculated by the simulation methods based on geometrical acoustics and wave acoustics</article-title>
          ,
          <source>Institute of Industrial and Science</source>
          , University of Tokyo (
          <year>2002</year>
          )
          <fpage>2715</fpage>
          -
          <lpage>2716</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>E.</given-names>
            <surname>Vasiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shkurko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Mallett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Brunvand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yuksel</surname>
          </string-name>
          ,
          <article-title>A detailed study of ray tracing performance: render time and energy cost</article-title>
          ,
          <source>The Visual Computer</source>
          <volume>34</volume>
          (
          <year>2018</year>
          ).
          <source>doi:10.1007/ s00371-018-1532-8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>J. B. Schneider</surname>
          </string-name>
          ,
          <string-name>
            <surname>Understanding the</surname>
          </string-name>
          Finite-Diference
          <string-name>
            <surname>Time-Domain</surname>
            <given-names>Method</given-names>
          </string-name>
          , Washington,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>N.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Hinton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <article-title>Dropout: A simple way to prevent neural networks from overfitting</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <year>2014</year>
          )
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          . URL: http://jmlr.org/papers/v15/srivastava14a.html.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Suganuma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shirakawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nagao</surname>
          </string-name>
          ,
          <article-title>A genetic programming approach to designing convolutional neural network architectures</article-title>
          ,
          <source>in: Proceedings of the Genetic and Evolutionary Computation Conference</source>
          , GECCO '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>497</fpage>
          -
          <lpage>504</lpage>
          . URL: https://doi.org/10.1145/3071178.3071229. doi:
          <volume>10</volume>
          .1145/3071178.3071229.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>M. J. Bianco</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Gerstoft</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Traer</surname>
            , E. Ozanich,
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Roch</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gannot</surname>
          </string-name>
          , C.
          <article-title>-A. Deledalle, Machine learning in acoustics: Theory and applications</article-title>
          ,
          <source>The Journal of the Acoustical Society of America</source>
          <volume>146</volume>
          (
          <year>2019</year>
          )
          <fpage>3590</fpage>
          -
          <lpage>3628</lpage>
          . URL: https://doi.org/10.1121/1.5133944. doi:
          <volume>10</volume>
          . 1121/1.5133944.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>