=Paper=
{{Paper
|id=Vol-3248/paper10
|storemode=property
|title=Identification of NLOS Acoustic Signal Using CNN and Bi-LSTM
|pdfUrl=https://ceur-ws.org/Vol-3248/paper10.pdf
|volume=Vol-3248
|authors=Hucheng Wang,Suo Qiu,Haoran Shu,Lisa (Jingjing) Wang,Xiaonan Luo,Zhi Wang,Lei Zhang
|dblpUrl=https://dblp.org/rec/conf/ipin/WangQSWLWZ22
}}
==Identification of NLOS Acoustic Signal Using CNN and Bi-LSTM==
<pdf width="1500px">https://ceur-ws.org/Vol-3248/paper10.pdf</pdf>
<pre>
Identification of NLOS Acoustic Signal Using CNN
and Bi-LSTM
Hucheng Wang1,2 , Suo Qiu2 , Haoran Shu2 , Lisa (Jingjing) Wang3 , Xiaonan Luo1 ,
Zhi Wang1,* and Lei Zhang4
1
  Guilin University of Electronic Technology, Guilin, 541004, China
2
  Zhejiang University, Hangzhou, 310027, China
3
  Research China, Signify holding, Shanghai, 200233, China
4
  Chang’an University, Xi’an, 710061, China


                                         Abstract
                                         Compared with other indoor positioning techniques, acoustic signal is an ideal medium for indoor
                                         position systems due to its high compatibility and low deployment cost. The most vital reason for the
                                         degradation of performance in the process of acoustic signal propagation is non-line-of-sight (NLOS).
                                         The traditional signal filtering process is tedious and time-consuming. At the same time, deep learning
                                         has shown excellent performance in acoustic signal processing and classification tasks. In this letter,
                                         an acoustic signal line-of-sight (LOS)/NLOS identification method based on a convolutional neural
                                         network (CNN) and bi-directional long short-term memory (Bi-LSTM) models is proposed. Instead of the
                                         spectrogram, the acoustic signal spectrum matrix was fed into the network. The CNN was employed to
                                         extract the features from the two-dimensional image-like spectrum matrix automatically, and Bi-LSTM
                                         was utilized for classification. We evaluated the classification accuracy of the CNN and Bi-LSTM with
                                         different architectures, and found that the best one achieved 97.34% in classification performance.

                                         Keywords
                                         Acoustic, NLOS, CNN, Bi-LSTM


1. Introduction
In indoor Internet of Things (IoT) technology, the location of the user is crucial privacy data for
humanized services [1]. Owing to the inability of GNSS signals to penetrate walls and urban
shielding effects, indoor positioning often requires additional signal media. An acoustic signal
has a natural low synchronization cost compared with other indoor positioning techniques. Part
of the electromagnetic positioning methods that are queried through the fingerprint database has
poor accuracy based on the size of the grid. In contrast, Time of Arrival (ToA)/Time Differential
of Arrival (TDoA) -based acoustic positioning usually only has a positioning accuracy of
decimeters to centimeters [2]. More significantly, the acoustic signal is fully compatible with


 IPIN 2022 WiP Proceedings, September 5 - 7, 2022, Beijing, China
*
  Corresponding author.
$ hcwang0717@gmail.com (H. Wang); 12032102@zju.edu.cn (S. Qiu); lisa.wang@signify.com (L. (. Wang);
luoxn@guet.edu.cn (X. Luo); zjuwangzhi@zju.edu.cn (Z. Wang); zhlei0202@163.com (L. Zhang)
 0000-0002-9744-389X (H. Wang); 0000-0002-0751-5045 (X. Luo); 0000-0002-0490-2031 (Z. Wang);
0000-0001-5879-514X (L. Zhang)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
smart terminals, such as smartphones currently on the market. Users do not need to install
other hardware devices, which is more conducive to promotion and dissemination.
   Acoustic signals usually have a frequency of approximately 20 Hz–20 kHz and a wavelength
of approximately 17 mm–17 m. The attenuation is evident after being blocked by obstacles.
The accuracy of raw acoustic positioning is usually not ideal in the absence of any correction
treatment. When the transmitter source, like a speaker, and the receiving source, like a smart-
phone microphone, are not directly reachable, the acoustics may be reflected on the surrounding
walls multiple times to reach it, which causes signal delay, and may also exhibit signal loss and
degeneration.
   In the acoustics of indoor positioning systems (IPSs), the NLOS is one of the most formidable
factors that decrease positioning accuracy. NLOS communication is a situation where wireless
signals cannot reach the receiver directly due to the presence of obstacles. Detecting, filtering,
or correcting NLOS signals has become a crucial part of IPSs. The accuracy of the detection
algorithm will directly affect every link. In academia, the following judgment schemes often
exist.
   ∙ Most methods will record the ranging values of the previous data and compare them
     with the value of the current data. The record passes through the moving or static state
     of the measurement target, and refers to the speed or step length of users, the moving
     direction, and so on. The data can be obtained themselves [3], or through other external
     sensors, such as installing an inertial measurement unit to obtain inertial data [4, 5], and
     correcting NLOS or missing data according to the coarse-grained coupling algorithm [6].
     This approach is not suitable for flexible maneuvering targets.
   ∙ Another research hotspot is signal features extraction, such as Channel State Information
     (CSI) [7], propagation delay [8], channel quality [9], energy intensity [10, 11], and statis-
     tical data, such as machine-learning training samples [12, 13]. The signal propagation
     distance will be estimated through huge data analysis and calculation of generalized cross-
     correlation (GCC), finding the correlation peak, and calculating the time-delay interval of
     the direct signal from the messy, raw signal. Support Vector Machines (SVMs), Variational
     Autoencoders (VAEs), decision trees are often employed. [14] collected the time-delay
     characteristics, waveform characteristics, Rician K-factors, and frequency characteristics
     of relative channel gain and summarized them into the Radial-based Function (RBF) core.
     [15] proposed a structured Bi-LSTM to train a three-dimensional (3D) terahertz signal.
     [11] improved [10] and obtained better results after denoising.
   ∙ The building structure of the room and the indoor map are also ways to distinguish NLOS
     information [16].
   An acoustic signal has obvious time relevance, and the spectrum of acoustics has solid
characteristic information, which reminds us to use deep learning to distinguish NLOS data.
In this letter, we propose a novel sound NLOS signal recognition method that combines a
convolutional neural network (CNN) and Bi-LSTM. The CNN extracts the spectral features of
the acoustics, and Bi-LSTM classifies the NLOS recognition with strong time relevance. The
Figure 1 shows the main structure.
   The remaining parts of this letter are organized as follows: Section II describes the acoustic
signal and its spectral characteristics; Section III shows how to choose and use the CNN and
                LSTM unit             LSTM unit                 LSTM unit                                                                ...
                  L(i-1)                 L(i)                     L(i+1)                             s(t)                      s(t)

                                                                                                 0          t              T     T+t


            CNN block(i-1)         CNN block(i)          CNN block(i+1)
            512 channels           512 channels           512 channels


                                                                                                                                      LOS       NLOS


                                                                                                                                      others


   Audio captured             Input              CNN                Feature map
                                                                                       Bi-LSTM                  FC layer          Classification result
  from microphone        spectrum matrix    without FC layer        512 channels


Figure 1: Main structure of proposed method.


Bi-LSTM classification; the training results, data analysis, and planned future work are presented
in Section IV; and Section V concludes the paper.


2. Characterization of acoustic signal
The acoustics in this work is an autonomously modulated chirp signal from [17]. A single chirp
signal can be modulated as
                                                                1
                                             𝑠(𝑡) = 𝑒𝑥𝑝(𝑗2𝜋(𝑓0 + 𝑢0 𝑡2 )),                                                                         (1)
                                                                2
where 𝑓0 and 𝑢0 are the initial frequency and modulation rate, respectively.
   To facilitate the analysis of the waveform, we inserted a silent interval instead of Frequency
Modulated Continuous Wave (FMCW) to form a complete period 𝑇 . Then, the transmitted
signal can be described as
                                                      ∞
                                                      ∑︁
                                           𝑡(𝜏 ) =             𝜀(𝑡 − 𝜏 + 𝑖𝑇 )𝑠(𝜏 − 𝑖𝑇 ),                                                           (2)
                                                      𝑖=0

where 𝜀(·) is a step function and 𝑖 denotes 𝑖th chirps.

                    frequency

                                           A complete period T


                                                                                                                   ...
                                   s(t)                                                 s(t)

                             0                    t                                T             T+t             time

Figure 2: Time-frequency diagram of chirp signal and silent interval.


  Considering a complex indoor environment, there are 𝑁𝑅 reflected rays and 𝑁𝐷 diffracted
rays received by a microphone, and then the channel response of a microphone can be described


                                                                                                                LOS                            NLOS
     LSTM unit
       L(i+1)                frequency

                                         A complete period T


LSTM unit                                                                    ...
                                      s(t)                         s(t)
  L(i+1)           as
                                                                                             𝑁𝑅
                                  0          t                 T      T+t   time             ∑︁
                                                                                                    𝑚
                                                 𝑟(𝜏 ) =𝛼𝐿 𝑁𝐿 𝑡(𝜏 − 𝜏𝐿 )𝑤(𝜏 ) +                    𝛼𝑅 𝑡(𝜏 − 𝜏𝑅𝑚 )𝑤(𝜏 )
                                                                                             𝑚=1
block(i+1)
                                                                                                                                       (3)
                                                               𝑁𝐷
channels                                                       ∑︁
                                                                      𝑛        𝑛
                                                           +         𝛼𝐷 𝑡(𝜏 − 𝜏𝐷 )𝑤(𝜏 ) + 𝑛(𝜏 ),
                                                               𝑛=1

                   where the subscripts 𝐿, 𝑅, and 𝐷 denote the parameters related to LOS ray, reflection rays, and
                   diffusion rays, respectively, as shownLOS
                                                          in Figure
                                                                NLOS3, and 𝛼 denotes the attenuation of different

                   rays. The black-man window 𝑤(·) is employed to erase the slight multi-way fluctuation. The
                   residual noise, such as electromagnetic vibration noise, is represented by 𝑛(·). 𝜏𝐿 , 𝜏𝑅𝑚 , and 𝜏𝐷
                                                        others
                                                                                                                    𝑛
                                                                                                                                (··)
                                                                                                                         (··)   𝑑
                   refer to the propagation delays from different paths, and are calculated by 𝜏(·) = (·) 𝑐 , where
    Feature map
                       Bi-LSTM             FC layer   Classification result
    512 channels   the superscript is the 𝑚th or 𝑛th path, and the subscript is the way of the arrival path, and 𝑐 is
                   the velocity of sound.

                                                                                    Wall


                                      Microphone                                                               Microphone


 layer                                                  Shelter or Screen


                                          LOS
                                          NLOS                                     Speaker


NLOS
                   Figure 3: Schematic of LOS and NLOS (reflection and diffusion).


                      𝑁𝐿 = {0, 1} indicates whether the LOS signal reaches the microphone. Generally, NLOS is
                   composed of reflection and diffusion rays. The time-domain diagrams of the signal are shown in
on result
                   Figure 4. We intercepted the 1-second signal for comparison. Note that, in terms of magnitude,
                   the NLOS is weaker than one-fifth of the LOS.
                      To avoid the noise from human activity interference, we modulate the audio frequency band
                   above 18 kHz as the positioning signal, and the sampling rate of commercial mobile phones on
                   the market is 48 kHz can accept such a frequency band.
                      Spectrogram: The image is directly output by acoustic software or function, whose dimen-
                   sions are dependent on time and frequency, and the value is filled with Power Spectral Density
                   (PSD), as shown in Figure 5.
                      Spectrum Matrix: The new concept we proposed has the same dimensions as the spec-
                   trogram. The short-time Fourier transform (STFT) constitutes its value. Since this kind of
                   information is presented in a matrix, we named it the spectrum matrix. This letter is stored as a
                   regularized grayscale image imitating the spectrogram, as shown in Figure 6.
                10000
                    0
                10000
                        0.0       0.2        0.4      0.6         0.8        1.0
                                            time (seconds)
                                             (a) LOS

                2000
                    0
                2000
                        0.0      0.2        0.4      0.6         0.8        1.0
                                           time (seconds)
                                            (b) NLOS

Figure 4: Audio captured from microphone in LOS and NLOS conditions (at a distance of 20 m).


  The separation of coupled signals has always been a complicated issue. Typical filters cannot
separate the aliased signal to an accuracy of more than 90%, while neural networks give us new
ideas.


3. Proposed neural network method


                              (a) LOS                     (b) NLOS

Figure 5: Spectrogram comparison between LOS and NLOS.


  It can be seen that the 2D image-like spectrograms in Figure 5 have prominent block area
characteristics. However, not every pixel and every color in the spectrogram has significance.
Training the spectrogram images will produce many redundant and useless features and decrease
the training accuracy. We further purified the spectrum information to obtain the spectrum
matrix with STFT in Figure 6. This refines the training data and incidentally filters out part of
the background noise.
                               forget
                                gate

                            input              +
              xi                                         xi+1
                             gate
              hi-1                      ×                 hi
              ci-1        σh                              ci
                                              ×
                                   (a) LOS
                               output                              (b) NLOS
                                gate
  Figure 6: Spectrum matrix comparison between LOS and NLOS.


                      LSTM unit                        LSTM unit               LSTM unit
                        L(i-1)                            L(i)                   L(i+1)
                                                                                                                   A comp


                LSTM unit                         LSTM unit               LSTM unit
                  L(i-1)                             L(i)                   L(i+1)                               s(t)

                                                                                                             0          t


           CNN block(i-1)                   CNN block(i)             CNN block(i+1)
           512 channels                     512 channels              512 channels

  Figure 7: Basic Bi-LSTM framework.


  3.1. Feature extraction of spectrum matrix based on CNN
   A CNN is a multi-layer deep-learning neural network, which implies multiple convolutional
   layers and multiple pooling layers. A CNN uses gradient descent to minimize the layer-by-layer
   reverse adjustment of the weight parameters in the network by the loss function and improves
   the network accuracy through frequent iterative training.
 Audio captured                   Input                  CNN                Feature map
from The   convolutional layer
      microphone                is designed
                            spectrum  matrixto extract
                                                  withoutfeatures  of the spectrum
                                                            FC layer               matrix sequence.Bi-LSTM
                                                                            512 channels
   Multiple kernels are employed in the convolutional layers. The Rectified Linear Units (ReLU)
   activation function converts the spectrum matrix into characteristics. The MaxPooling layers
   down-sample the size of the characteristics.
      The fully connected (FC) layer is usually the last layer of the CNN with the SoftMax function.
   However, the traditional FC layer in a CNN ignores contextual relevance. NLOS rarely appears
   by itself, but does so continuously. Based on the above characteristics, in the present work we
   canceled the FC layer of the CNN and replaced it with Bi-LSTM for the NLOS classification task.

  3.2. NLOS classification based on Bi-LSTM
  In the virtual environment, occlusion is not independent. For a long-term sound wave sequence,
  LSTM is one of the best solutions. In fact, we found that both forward and backward information
  can help determine occlusion, so Bi-LSTM is adopted herein.
   The bi-directional RNN can use the information before and after the current moment, which
further promotes the accuracy of information judgment. The traditional LSTM will pass (𝑖 − 1)
state information before the 𝑖th moment. If the current (𝑖 − 1) states are all LOS, and the user
is switched from LOS to NLOS, the previous information is not sufficient to obtain a correct
judgment. Introducing the backward LSTM layer, (𝑖 + 1) and the future state can amend the
current judgment. The output of Bi-LSTM can be described as

                                    O𝑖 = 𝑔(W𝑓𝑖 𝑜𝑖𝑓 + W𝑏𝑖 𝑜𝑖𝑏 ),                                  (4)

where O𝑖 is the output vector of Bi-LSTM; 𝑜𝑖𝑓 represents the 𝑖th node of forward LSTM, while 𝑜𝑖𝑏
is the 𝑖th node of backward LSTM. 𝑔 denotes the ReLU activation function. W𝑓𝑖 and W𝑏𝑖 are the
trainable matrices, 𝑜𝑖𝑓 and 𝑜𝑖𝑏 can be disassembled from the single LSTM unit. O = {O1 , O2 , ...}
will be classified by the FC and SoftMax layers to identify the LOS and NLOS data.
   The propagation of the CNN blocks 𝑥𝑖 in the forget gate can be expressed as

                                 𝑓 𝑖 = 𝜎𝑔 (𝜔𝑓 𝑥𝑖 + 𝑢𝑓 ℎ𝑖−1 + 𝑏𝑓 ),                               (5)

where 𝑓 𝑖 ∈ Rℎ is the output vector in the forget gate, 𝜔𝑓 ∈ Rℎ×ℎ and 𝑢𝑓 ∈ Rℎ×𝑑 are the
updating weight vectors, and 𝑏𝑓 ∈ Rℎ denotes the bias vector. The activation function 𝜎𝑔 is
selected as the Sigmoid function. The forget gate contains the values of 𝑓 𝑖 ⊂ [0, 1], which
decides the keeping degree of the memory cell through the next operation.
   The input gate regulates the input data 𝑥𝑖 and the processed state vector 𝑓 𝑖 from the forget
gate, which is described as

                                 𝑖𝑖 = 𝜎𝑔 (𝜔𝑖 𝑥𝑖 + 𝑢𝑖 ℎ𝑖−1 + 𝑏𝑖 ),
                                ˜ 𝑖 = 𝜎ℎ (𝜔𝐶 𝑥𝑖 + 𝑢𝐶 ℎ𝑖−1 + 𝑏𝐶 ),
                                𝐶                                                                (6)
                                                        𝑖
                                                     ˜ ,
                                𝐶 𝑖 = 𝑓 𝑖 𝐶 𝑖−1 + 𝑖𝑖 𝐶

where 𝜔𝑖 , 𝜔𝐶 ∈ Rℎ×ℎ , 𝑢𝑖 , 𝑢𝐶 ∈ Rℎ×𝑑 denote the updating weight vector that iterates through
training in the input gate, and 𝑏𝐶 , 𝑏𝑖 ∈ Rℎ denotes the bias. The activation function 𝜎ℎ is
selected as the tanh(·) function. When the candidate cell state vector 𝐶   ˜ 𝑖 is computed, the real
cell state vector 𝐶 𝑖 can be updated with last cell state 𝐶 𝑖−1 .
   The third gate in a single LSTM unit is the output gate. The output 𝑜𝑖 will be based on the
above cell state, but it is also a filtered version. The output gates are written as

                                  𝑜𝑖 = 𝜎𝑔 (𝜔𝑜 𝑥𝑖 + 𝑢𝑜 ℎ𝑖−1 + 𝑏𝑜 ),
                                                                                                 (7)
                                 ℎ𝑖 = 𝑜𝑖 · 𝜎ℎ (𝐶 𝑖 ).

The 𝜔𝑜 ∈ Rℎ×ℎ , 𝑢𝑜 ∈ Rℎ×𝑑 , and 𝑏𝑜 ∈ Rℎ also represent the weight and bias, correspondingly.
The hidden state ℎ𝑖 will be updated in this gate from the output 𝜔𝑜 and the new cell state 𝐶 𝑖
with the activation function 𝜎ℎ . Then, we insert 𝑜𝑖 into Eq. 4 and obtain the total output O𝑖 in
Bi-LSTM.
4. Dataset and experimental results
To assess the proposed method, we designed an experiment and collected audio data. The
LOS and NLOS data were collected from four microphones in different locations and four
different indoor rooms: Laboratory 1, Laboratory 2, Office 1, and Office 2. For each microphone,
more than 400 pieces of LOS and 400 pieces of NLOS data were collected. Different rooms
were selected to extend room impulse response (RIR) information and prevent over-fitting.
Microphones in different locations mean that the training is generalized to every part of the
entire room. Each piece of data was washed and sliced to a length of 1s and labeled. A total of
12,800 raw audio samples were composed. We shuffled all the data for each epoch and selected
8,960 samples (70% of 12,800) as a training set and the rest samples comprised a testing set. All
the sampling processes were entirely random to prevent over-fitting the model. Based on the
above data, the model training takes 1 hour and 22 minutes. After putting a single data in the
test dataset into the model, it took 0.98 seconds to get the classification result. The dataset is
available on IEEE Dataport; more detailed descriptions of the experiments and collections can
be found by contacting the authors.

4.1. From raw audio wave to spectrum matrix
Generally, audio data will be analyzed with the help of spectrograms of PSD. We trained the
data based on the spectrogram and obtained the result shown at the top of Figure 8. We assumed
that this occurred for two reasons. First, most of the data captured in the spectrogram are
useless information, and background noise accounts for the majority of it, such as human voices,
mechanical vibrations. Second, the acoustic positioning system is arranged to operate as far as
several dozens of meters usually. The target signal captured by the microphone may be weaker
than other background noises, leading to the testing loss decreasing to zero. Changing the input
data from the signal processing level can significantly improve the classification effect. The
bottom panel of Figure 8 shows that the spectrum matrix results in an apparent gain in the
classification training result.

4.2. Network design and configuration
For the CNN segment, which was sequential at first, we set the size of the kernel to 7 × 7 in the
convolution layer to quickly reduce the input dimension. We then set the batch normalization
and ReLU layers to accelerate training and prevent the internal covariate shift phenomenon.
The only maxpooling layer was used for down-sampling and highlighting essential information.
Next, we re-used the convolution, batch normalization, and ReLU layers to enhance the feature
matrix from 64, 128, 256, and 512. Finally, in the CNN, adaptive average pooling was configured
to extract the feature block of 512 × 1 × 1 and put it into the Bi-LSTM network.
   The design of Bi-LSTM is detailed in Section III.B. Bi-LSTM compressed the CNN block into
32 states and classified it by FC layer. Interestingly, we found that double Bi-LSTM gives better
training results than single or double LSTM. Therefore, finally, we adopted the proposed model
as double Bi-LSTM underlying the CNN.
   Regarding the hyper-parameters, the batch size was set to 256 and the hidden layers of LSTM
                                                  Training and testing accuracy for spectrogram
                             100


                              80


                  Accuracy
                                                                                              testing set
                                                                                              training set
                              60


                              40
                                   0    5          10       15           20        25    30          35         40
                                                             Number of Epochs
                                                                   (a)

                                             Training and testing accuracy for spectrum matrix
                             100


                              80
                  Accuracy


                                                                                               testing set
                                                                                               training set
                              60


                              40
                                   0         20               40              60              80               100
                                                            Number of Epochs
                                                                   (b)

Figure 8: Results of spectrogram and spectrum matrix.

                                            Comparison of different structures                                         10
         100

                                                                                                                       8
            80
     Accuracy


                                                                                                                       Testing loss


                                                                                             single Bi-LSTM 6
                                                                                             double Bi-LSTM
            60
                                                                                             double LSTM
                                                                                                            4

            40
                                                                                                                       2

            200                                                                                                        0
                              10       20      30        40    50    60                 70         80         90     100
                                                        Number of Epochs
Figure 9: Accuracy comparison of different LSTM structures.
Table 1
Comparison with different methods and models.
                     Method (with model)             Accuracy (%)    Epochs to stop
                CNN with Bi-LSTM (spectrogram)           45.45               1
                    CNN with single Bi-LSTM              96.78              96
                     CNN with double LSTM                95.83              91
                CNN with         ResNet 18               97.34              70
                  double         ResNet 34               96.40              78
                 Bi-LSTM         ResNet 50               94.31              80
                (proposed)        VGG 19                 96.02              46
                                 MobileNet               97.15              88


to 256. We used stochastic gradient descent to optimize the network weights, with an initial
learning rate of 0.01, momentum of 0.9, and weight decay of 3e-4.
   The value of the spectrum matrix was normalized to 0–255 so that the images could be stored
on disk as grayscale images. Moreover, as the sizes of the images were different, these images
were padded to 256 × 256 before feeding to the network model.
   All the experiments and models were implemented in PyTorch on two NVIDIA RTX 3090
graphical processing units. It is worth mentioning that, in order to accelerate the training
process, we selected Apex [18] to implement mixed precision and distributed training.

4.3. Discussion
It can be seen from Figure 9 that the testing set experienced massive jitters before stabilization.
This sharp deterioration did not accompany the training set, nor did it fail to converge or
explode in gradients. Instead, it returned to normal spontaneously and this process repeated
continuously. As the number of epochs increases, the testing loss and learning rate gradually
stabilize, the jitters disappear, and the testing set tends to be stable. We suspect that this may be
due to the low effective signal-to-noise (SNR) ratio. Since the experimental scene is close to the
genuine circumstance, we did not clear the sound of wind and people talking, but deliberately
mixed it as noise, and arising the difficulty of identification.
   Fortunately, the proposed model is robust enough to eliminate this phenomenon within
about 80 epochs (at least 46 epochs in the VGG 19 model) and guarantee an accuracy rate of
approximately 94% (up to 97.34% in the ResNet 18 model).
   Although the Doppler effect is widely used in the vehicle FWCM radar, the static state cannot
produce the Doppler effect. The method proposed in this paper does not require Doppler radar,
and can effectively determine whether NLOS for stationary and dynamic targets.


5. Conclusions
In this letter, a novel method of identifying LOS/NLOS acoustics is proposed. We use a CNN
and Bi-LSTM in the deep neural network and the adapted training model to increase the signal
recognition accuracy to at least 97.34%. As the input of the neural network, instead of the
raw audio signal and spectrogram, a 2D image-like spectrum matrix is proposed to obtain
the classification precisely. In the field of acoustics in an IPS, this letter reports, to our best
knowledge, the first use of a focusing neural network to classify NLOS signals.


Acknowledgements
This work was supported in part by the Fundamental Research Funds for the Central Universities
(Zhejiang University NGICS Platform) and by the National Natural Science Foundation of China
under Grant Nos. 61773344, 61273079, 61772149, 61936002, and 6202780103.


References
 [1] X. Guo, N. Ansari, F. Hu, Y. Shao, N. R. Elikplim, L. Li, A survey on fusion-based indoor
     positioning, IEEE Communications Surveys & Tutorials 22 (2019) 566–594.
 [2] S. Cao, X. Chen, X. Zhang, X. Chen, Combined weighted method for tdoa-based localization,
     IEEE Transactions on Instrumentation and Measurement 69 (2019) 1962–1971.
 [3] S. Zhang, C. Yang, D. Jiang, X. Kui, S. Guo, A. Y. Zomaya, J. Wang, Nothing blocks me:
     Precise and real-time los/nlos path recognition in rfid systems, IEEE Internet of Things
     Journal 6 (2019) 5814–5824.
 [4] H. Wang, L. Zhang, Z. Wang, X. Luo, Pals: high-accuracy pedestrian localization with
     fusion of smartphone acoustics and pdr., in: IPIN (Short Papers/Work-in-Progress Papers),
     2019, pp. 291–298.
 [5] Q. Xu, R. Zheng, S. Hranilovic, Idyll: Indoor localization using inertial and light sensors
     on smartphones, in: Proceedings of the 2015 ACM International Joint Conference on
     Pervasive and Ubiquitous Computing, 2015, pp. 307–318.
 [6] S. Gao, F. Zhang, G. Wang, Nlos error mitigation for toa-based source localization with
     unknown transmission time, IEEE Sensors Journal 17 (2017) 3605–3606.
 [7] J.-S. Choi, W.-H. Lee, J.-H. Lee, J.-H. Lee, S.-C. Kim, Deep learning based nlos identification
     with commodity wlan devices, IEEE Transactions on Vehicular Technology 67 (2017)
     3295–3303.
 [8] F. Xiao, Z. Guo, H. Zhu, X. Xie, R. Wang, Ampn: Real-time los/nlos identification with wifi,
     in: 2017 IEEE International Conference on Communications (ICC), IEEE, 2017, pp. 1–7.
 [9] K. Wen, K. Yu, Y. Li, Nlos identification and compensation for uwb ranging based on
     obstruction classification, in: 2017 25th European signal processing conference (EUSIPCO),
     IEEE, 2017, pp. 2704–2708.
[10] C. Jiang, J. Shen, S. Chen, Y. Chen, D. Liu, Y. Bo, Uwb nlos/los classification using deep
     learning method, IEEE Communications Letters 24 (2020) 2226–2230.
[11] C. Jiang, S. Chen, Y. Chen, D. Liu, Y. Bo, An uwb channel impulse response de-noising
     method for nlos/los classification boosting, IEEE Communications Letters 24 (2020) 2513–
     2517.
[12] V.-H. Nguyen, M.-T. Nguyen, J. Choi, Y.-H. Kim, Nlos identification in wlans using deep
     lstm with cnn features, Sensors 18 (2018) 4057.
[13] V. Barral, C. J. Escudero, J. A. García-Naya, R. Maneiro-Catoira, Nlos identification and
     mitigation using low-cost uwb devices, Sensors 19 (2019) 3464.
[14] L. Zhang, D. Huang, X. Wang, C. Schindelhauer, Z. Wang, Acoustic nlos identification
     using acoustic channel characteristics for smartphone indoor localization, Sensors 17
     (2017) 727.
[15] S. Fan, Y. Wu, C. Han, X. Wang, A structured bidirectional lstm deep learning method for
     3d terahertz indoor localization, in: IEEE INFOCOM 2020-IEEE Conference on Computer
     Communications, IEEE, 2020, pp. 2381–2390.
[16] N. Rajagopal, P. Lazik, N. Pereira, S. Chayapathy, B. Sinopoli, A. Rowe, Enhancing indoor
     smartphone location acquisition using floor plans, in: 2018 17th ACM/IEEE International
     Conference on Information Processing in Sensor Networks (IPSN), IEEE, 2018, pp. 278–289.
[17] L. Zhang, M. Chen, X. Wang, Z. Wang, Toa estimation of chirp signal in dense multipath
     environment for low-cost acoustic ranging, IEEE Transactions on Instrumentation and
     Measurement 68 (2018) 355–367.
[18] N. Apex, Nvidia apex: Tools for easy mixed-precision training in pytorch, https://developer.
     nvidia.com/blog/apex-pytorch-easy-mixed-precision-training/, 2022.

</pre>