<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Data Science Techniques for Datasets on Mental and Neurodegenerative Disorders, June</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A benchmarking study of deep learning techniques applied for breath analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vincenzo Dentamaro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Giglio</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donato Impedovo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luigi A. Moretti</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Pirlo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Sblendorio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Biomedicine and Prevention, University of Rome Tor Vergata</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bari Aldo Moro</institution>
          ,
          <addr-line>Via Orabona 4, 70121, Bari</addr-line>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of the West of England (UWE) - Coldharbour Ln</institution>
          ,
          <addr-line>Stoke Gifford, Bristol BS16 1QY</addr-line>
          ,
          <country country="UK">UK</country>
          <addr-line>3</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>22</volume>
      <issue>2023</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>In Machine Learning, new architectures are continually proposed, making difficult to evaluate which configurations better fit specific fields and tasks. The most reliable way to overcome this issue is to test them using the same data and parameters. In this work, five state of art deep neural network architectures has been performed in a promising field of health technology: the breath analysis. In particular it is reported that standard convolutional neural networks exploiting inductive bias, do not perform as well as the AUCO ResNet, an architecture designed for audio classification. In addition, the Vision Transformer model need lots of data to learn patterns showing the limitation of this technique even when transfer learning is performed.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Breath analysis</kwd>
        <kwd>transfer learning</kwd>
        <kwd>AUCO ResNet</kwd>
        <kwd>Mel Spectrogram</kwd>
        <kwd>benchmark</kwd>
        <kwd>ViT 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Three respiratory diseases were entrenched in the top 10 causes of death in the world. The chronic
obstructive pulmonary disease (COPD) alone kills 3.2 million people every year [1] .COVID19 pandemic
has remined us how vulnerable we are as a community. However, it has also been the opportunity to
demonstrate the potential of digital solutions as a fundamental support for the healthcare system.
Improving early detection is considered an essential step to reduce the burden of respiratory diseases.
[1] However, accessible, affordable, and reliable tools must be designed for behavioral biometric
analysis. [2] Machine Learning (ML) and Smart Sensors can be valuable resources in this matter, but
the research must consider the practical implementations, to avoid wasting time and resources
developing solutions which performance are not reproducible in the clinical environment. [3]
Benchmarks are of paramount importance when it comes to test different models under the same
conditions. This allows a fair comparison of their accuracies and provides insights into the strengths
and weaknesses of each model. This study aims to provide an overview of five architectures applied
to a database of breath sounds. Mostly of them have been originally developed for computer vision
tasks, thus a filter converting sounds in images (i.e spectrograms) has been applied. Two different
tasks have been performed, to test the models both in a binary and multiclass classification.
The main research question is to understand which model is best suited to perform audio breath
analysis when networks are trained in a transfer learning fashion.</p>
      <p>The work is organized as follows: Section II introduces the state-of-the-art review. Section III Material
describes the datasets used for training and transfer learning as well as the pre-processing applied.
Section IV Methods presents the architectures used as well as the experimental setup. Section V
sketches the results. In Section VI there is the discussion of results. While Section VII contains
conclusions and future research directions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. State of the art review</title>
      <p>The database ICBHI 2017 has been used by various authors all in different conditions. Almost all
authors since 2019 have been using deep neural network architectures, which have shown promising
results. This brief literature review is focused on deep neural networks architectures already used on
the ICBHI 2017 dataset. In particular in work [4] authors finetuned a pre-trained ResNet architecture
on the ICBHI 2017 and multi-channel lung sound datasets for performing binary healthy/unhealthy
classification reaching 87.59 of F1-Score. Authors in [5] addressed the limited size dataset issue using
supervised contrastive learning. This technique relies on respiration cycle annotations as well as
spectrogram frequency and temporal masking in order to generate augmented samples for
representation learning with a contrastive loss. The reached accuracy is 0.759.</p>
      <p>The Respiratory Sound Classification Network (ARSC-Net) [6] is a network designed for accurate
respiratory sound classification. It combines residual blocks with channel-spatial attention to extract
and classify two types of features from adventitious respiratory sounds: Mel-Frequency Cepstral
Coefficients (MFCCs) and Mel-spectrogram. The two types of features are processed in parallel
through encoder paths with residual attention to obtain a feature representation, which is then
merged in a channel-spatial attention module to focus adaptively on important features in both the
channel and spatial domains. The channel-spatial attention improves the feature representation by
exploring inter-channel relationships in the spectrums using channel attention and generating
interspatial correlation mapping through serial spatial attention. The reached accuracy is 80% in binary
classification (healthy/unhealthy) but no inter-subjects separation scheme was used. In [7] authors
come up with a new way to enhance the classification of respiratory sounds performing data
augmentation. The approach involves changing and moving around the input data. The accuracy
reached is about 0.704.</p>
      <p>The LungBRN [8] and its evolution, the LungRN+NL [9] have been designed especially for the
ICBH2017 challenge. The LungBRN architecture is an advanced bi-ResNet deep learning architecture,
that utilizes STFT and wavelet feature extraction to enhance accuracy, while the LungRN+NL has
incorporated the non-local block in the ResNet architecture. In addition, to address the imbalance
problem, authors also added data augmentation to increase the accuracies. Respectively the LungBRN
and LungRN+NL achieved 0.692 and 0.632 of accuracy.</p>
      <p>As it is possible to observe from this small literature review, the majority of works are based on the
ResNet deep learning architecture trained from scratch or in transfer learning fashion. In addition, it
is possible to observe that some work used data augmentation. As pointed out in [10] data
augmentation has caused some confusion in identifying correct patterns for certain classes when
initial data is not representative. Additionally, data augmentation decreases reproducibility and raises
ethical concerns in the medical field [11].</p>
      <p>For these reasons, it has been decided to perform a benchmark of the most famous deep neural
networks architectures in computer vision as well as novel architectures such as the Vision
Transformer and AUCO ResNet all trained on exactly same data and in exactly same conditions.
Additionally, the methods used in this work do not make use of any data augmentation technique
increasing reproducibility of the results.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Material</title>
      <sec id="sec-3-1">
        <title>3.1. Respiratory Sound Dataset</title>
        <p>Compiled in 2017 at the International Conference on Biomedical Health Informatics (ICBHI 2017),
the Respiratory Sound Dataset [1] was built to provide reliable data for comparing different automatic
audio analysis. The database consists of a total of 5.5 hours of recordings collected in 920 annotated
audio samples from 126 subjects ranging from young to old ages. 6898 respiratory cycles are included,
of which 1864 contain crackles, 886 wheezes, and 506 both crackles and wheezes. The recordings
were collected using heterogeneous equipment and their duration ranged from 7.85 s to 86.2 s. Data
include clean breathing sounds and noisy recordings that simulate real-life conditions.</p>
        <p>Unfortunately, this database is not free from limitations. Not only some subjects have been used
to collect several audio samples (up to 60, making 6.5% of the entire sample size belonging to a single
subject), but the file samples are also strongly unbalanced both in the distribution of the seven
labelled diseases and between the number of healthy (no disease) and un-healthy subjects which
respectively are 35 (3.8%) and 885 (96.2%) (separation used for the Binary Task). These issues, shown
in Figure 1, have been taken in account and faced as later discuss in the present work. Given the lack
of data, instead of considering each disease individually, it has been preferred to group all diseases
into non-chronic diseases (i.e. Lower Respiratory Tract Infections (LRTI), Upper Respiratory Tract
Infections (URTI), Pneumonia, e Bronchiolitis, for a total of 75 samples) and chronic diseases (i.e.
COPD, Bronchiectasis e Asthma, including the remaining 810 samples), in addition of the third group
of healthy subjects (used for the Multiclass Task). Acute (i.e. temporary) diseases sometimes (e.g.
when under- or mis-treated) evolve in Chronic ones, thus any tool in support of medical
decisionmaking is highly welcomed.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. UrbanSound8K Dataset</title>
        <p>UrbanSound8K Dataset [2] is one of the biggest and more used of its kind. It collects 8732 audio
tracks, up to 4 seconds in length each, recorded from a urban environment and labelled in 10 quite
well balanced classes. The files are stored in .wav format, as in Respiratory Sound Database, organized
in 10 folds for an easier results comparison between different ML models. Given its remarkable
dimension, this dataset was used in the transfer learning procedure, as later explained.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Preprocessing on Respiratory Sound Dataset</title>
        <p>Since 94% of the Respiratory Sound Dataset was recorded at 44,100 Hz, it has been chosen to
sample all audio tracks within that sampling rate. Moreover, since the audio tracks have different
durations, as testify by the max and average length of 86.2 and 21.5 seconds respectively, each file
audio has been divided into its respiratory cycles, avoiding cutting all audios to a prefixed length with
a massive loss of data.</p>
        <p>As it is known from signal processing theory, there are some chunks of audio yielding more
information, resulting to be relevant for training the algorithms. Inspired by the conclusions of a study
focused on a similar task, but with a different respiratory disease (COVID19)[4] it has been supposed
that, also in this scenario, relevant information may be present between the transition from one
respiratory cycle to another. For this reason, each respiratory cycle has been segmented to ensure a
certain data overlapping (1s) with the next one (Figure 2). Some adjustments were needed when
dealing with cycles shorter than 1s, in those cases the gap was filled adding enough zeros to
compensate (padding technique).</p>
        <p>Given the average length of 3s per respiratory cycle, the cycles longer than 4s (1s overlapping
included) have been resized to be 4s long, while the shorter ones have been extended using padding.
After this procedure, each respiratory cycle corresponds to a (4*4,4100 =) 176,400-dimensional vector
of amplitudes.</p>
        <p>To reduce the chance of biased results, it was performed the User-based Dataset splitting, which
ensures that each subject, from which the data have been collected, belongs to the training or the
testing set only [3] as it will be detailed later.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methods</title>
      <p>Two classification tasks have been performed in this work to compare different architectures. The
first one aimed to recognised healthy vs unhealthy subjects, while the second one is a multi-class
classification test to recognize the specific disease (if healthy or unhealthy with chronic disease or
unhealthy with acute disease). The tested architectures are:</p>
      <p>AUCO ResNet [4], the Auditory Cortex ResNet (AUCO ResNet) is a biologically inspired deep neural
network especially designed for sound classification. It is built on the intuition that mammals have
evolved the sound perception in order to focus on certain frequencies better than others not audible
to the human ear even with a phonendoscope. This intuition is encoded by the presence of three
attention mechanisms namely the squeeze and excitation mechanism [5], the convolutional block
attention module [6], and the novel sinusoidal learnable attention [4]. This last attention mechanism
acts by merging relevant information from activation maps at various levels of the network acting
similarly to biological pyramidal-like neuronal cells that have been reported to code for high level
concepts by neuroscientist. AUCO ResNet takes as input raw audio and outputs the respective class,
without pre-processing, data augmentation or manual spectrogram generation. The model includes
elements also present in the biological auditory cortex of mammals (rats), such as it is composed by
six main blocks, it can evolve sound perception because also the mel spectrogram layer is trainable, it
has several attention levels and number of neurons within each stage has similar proportions of
neurons found in rats and similar functionalities.</p>
      <p>DenseNet 201 [17], with transfer learning pre-training and non-trainable Mel Spectrogram filter.
In the DenseNet architecture, the input from one layer is later concatenated with the feature maps of
all previous layers. This procedure allows for rich feature propagation and gradient flow, reducing the
vanishing gradient problem faced in deep networks. This leads to a reduction in the number of
parameters in the network and improved computation efficiency. The key features of the DenseNet
architecture are the presence of Dense blocks: a group of layers where each layer is connected to
every other layer in the same block. The Transition layers which are used to reduce the spatial size of
the feature maps and prevent overfitting. The use of a Global average pooling as the final layer of the
network, to generate the output predictions.</p>
      <p>ResNet50 [18], with transfer learning pre-training and non-trainable Mel Spectrogram filter.</p>
      <p>This architecture is designed to face the problem of training very deep networks, where the
accuracy degrades with increasing depth due to the vanishing gradients problem. ResNet solves this
problem by introducing residual connections. In residual connections, the input is directly added to
the output of each layer. The residual connections allow for the effective propagation of gradients,
even for very deep networks, and it is used as a technique to improve accuracy while reduced
overfitting. ResNet, together with DenseNet and many others have been shown to generalize. [19]
The architecture of ResNet consists of multiple residual blocks, where each block contains multiple
convolutional layers and it could include additional operations such as batch normalization as well as
attention mechanism [5]. The residual connections are implemented by summing the output of each
block with its input, before passing the result to the next block.</p>
      <p>InceptionResNet-V2 [20], with transfer learning pre-training and non-trainable Mel Spectrogram
filter. This architecture is designed to perform multiple parallel convolutional filters of different sizes
and pooling operations to capture information at multiple scales in the input image. It is similar to
multi-scale learning process in computer vision allowing, thus, to learn features at different levels of
detail, improving its overall representation and accuracy. The architecture is a sequence of multiple
Inception blocks, each of which contains multiple parallel convolutional and pooling operations with
in parallel filters, followed by a concatenation of the results. This allows the network to learn a wide
range of features and is computationally efficient, as it reduces the number of parameters in the
network.</p>
      <p>Vision Transformer (ViT) [21].</p>
      <p>The Vision Transformer architecture is based on the transformer architecture [22], which uses
selfattention mechanisms (multiheaded attention) to learn the patterns between input elements in a
sequence. In the case of Vision Transformer, the input elements are image patches (16x16 pixels), and
the self-attention mechanism learns relationships between patches. This process is different from the
one found in Convolutional Neural Network which seeks to exploit the inductive bias and produce the
convolutional filters. Self-attention is a mechanism which allows to attend to different parts of its
input and learn the relationships between the elements in the input. The ViT architecture consists of
multiple stacked transformer blocks, each of which contains a self-attention mechanism and a fully
connected layer used to generate the output predictions.</p>
      <sec id="sec-4-1">
        <title>4.1. Mel Spectrogram filter</title>
        <p>Some of the listed architectures (i.e. DenseNet201, ResNet50 and InceptionResnet-V2), already
implemented in Keras API, are designed for image classification tasks. Therefore, it is necessary to
implement the Mel Spectrogram filter, to convert audio files into images of their spectrograms. This
filter firstly extracts the information about the audio frequencies, computing the short-time Fourier
transform (STFT) and its magnitude, and then the features from each audio signal. Thus, it computes
and applies the matrix of the Mel Filter-Bank through a triangular Mel Filter-Bank, which imitates the
perceptions of humans’ ears. This setting choice has been led by the idea of making the results from
the algorithms as explainable as possible for a hypothetical clinical implementation. In AUCO ResNet
the Mel Spectrogram is inbuilt as a trainable layer of the network, and it learns the most discriminating
frequencies for each class of the dataset during training [4].</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Pre-training with UrbanSound4K</title>
        <p>All the listed architectures have been pre-trained with the UrbanSound4K Dataset, using 9 of the
10 folds as a training-set and the remaining one as test-set. For AUCO ResNet the first 250 layers were
frozen (apart the Mel Spectrogram layer), the later 150 layers were trainable, and a final dense layer
with softmax activation function was added for the classification. For DenseNet, InceptionResNetV2
and ResNet50 and ViT, the last layer was substituted with a novel dense layer with softmax activation
function for performing the final classification, while the entire network was allowed to be trained.
The training hyperparameters are reported in Table I.</p>
        <p>Parameters</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Architectures training and finetuning</title>
        <p>When training the architectures on the Respiratory Sound Dataset, the User-based Repeated
random sub-sampling validation with under sampling was used to finetune the hyperparameters. The
Repeated Random Sub-Sampling Validation, also called Monte Carlo Cross-Validation, has been
preferred to K-fold Cross-Validation. It randomly splits the training set in 80% for the actual training
and 20% for the validation at each iteration (10 in total) and finally provides the average of the metrics.
Moreover, given the aforementioned database unbalances, at each iteration it was performed an
under-sampling of the most numerous classes, to expose the models to the same number of patients
for each class.</p>
        <p>The labels of each target class, in both the tasks, was One-Hot encoded and the Softmax function
was used as the Activation Function. The procedure was performed twice using two different loss
functions: the Categorical Cross Entropy and the Balanced Categorical Cross-Entropy. The latter
consists in multiplying the former by a weight computed by using the number of examples for each
class. Therefore, if there is unbalancing between different classes, the loss function emphasizes the
samples of the minority classes.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Architectures testing</title>
        <p>The architectures have been tested classifying not only single respiratory cycles, but also complete
audio files of a specific patient in the test-set, to simulate a real-world scenario. Testing the models
on single patients was done by feeding the model with all the respiratory cycles of each patient. Thus,
the associated class has been associated by computing the Mode over all the respiratory cycles of
each patient.</p>
        <p>The statistical Mode has been used considering the non-remarkable results collected by the tested
architectures in this work. However, this approach may lead to a loss of data, especially in patients
with light or not totally manifested conditions. Even more, certain conditions may be spotted in
certain kind of breath cycles only (e.g. of specific duration range), and, again, using the Mode these
data would be lost.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The collected results have been computed using the categorical cross-entropy loss function.
Results are computed per cycle in binary classification (Table I) and multi-class classification (Table III),
as well as per patient, both in binary classification (Table II) and multi-task classification (Table IV). In
bold there are the highest performances. In Tables V are reported the percentages of wrong
predictions per respiratory cycle durations in the Binary classification task.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>In the Binary Task the AUCO ResNet performed better than the other architectures when
pretrained with the UrbanSound4K Database. It is not a surprising result considering that this architecture
has been developed for audio analysis.</p>
      <p>In the Multiclass Task the results are poor in all the architectures tested. Again, this was an
expected result given the unbalanced database used as well as its limited size. In fact, especially the
Vision Transformer architecture needs to be trained with a lot of data. In a similar fashion, also the
AUCO ResNet needs more data as its several attention mechanisms are not capable of filtering the
noise keeping the frequencies containing more information.</p>
      <p>Analysing the wrong prediction percentages in different respiratory cycle lengths, as shown in
Table V, it becomes clear that the architectures were better performing with longer cycles. This insight
could lead to new experimentations, in which only cycles with a certain minimum length would be
considered. Even if this suggested approach needs to be evaluated from a clinical perspective, because
important information could be stored in the small respiratory cycles too.</p>
      <p>Given the huge amount of COPD samples in the Respiratory Sound Database, it may be considered
the idea of developing a COPD detector, training AUCO ResNet architecture on a binary task: COPD
positive samples vs the sum of the other classes, for differential diagnosis purposes.</p>
      <p>Moreover, to improve the performances of the AUCO ResNet architecture with transfer learning,
it may be preferred a more contextualised dataset for the pre-training (e.g. the one used in the original
paper, based on breath and cough audios of COVID19 patients).</p>
      <p>Finally, the most obvious improvement would be achieved by feeding the models with a huge and
well-balanced dataset.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>Performing benchmarks is as important as building and releasing new architectures. It not only
allows to compare different solutions, but also to better contextualise their limitations and their
potential. This work goes even beyond, importing models from the computer vision field to concretely
taste their applicability in a totally different context. This is the essence of innovation, and it should
never be underestimated, even when the results are not that enthusiastic, as in this case. The hope is
to inspire other researchers to explore and test new combinations of architectures and configurations,
having in mind the real-world applicability of their work.</p>
      <p>AUCO ResNet has provided remarkable results in audio analysis, but new studies, based on more
balanced databases, are needed to deeply explore its potential.</p>
      <p>Once a ML architecture would reach satisfying results in this context, validated by clinical trials,
then endless opportunities, will be unlocked democratising access to care, bringing tangible benefits
to people all around the globe. This is more than enough to keep investing in AI research applied to
healthcare, despite the challenges it presents compared to other quicklier remunerative fields.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>This work was partially supported by the project FAIR - Future AI Research (PE00000013), Spoke 6
- Symbiotic AI, under the NRRP MUR program funded by the NextGenerationEU.
[10] V. Dentamaro, P. Giglio, D. Impedovo, L. Moretti, and G. Pirlo, “AUCO ResNet: an end-to-end network
for Covid-19 pre-screening from cough and breath,” Pattern Recognit, vol. 127, p. 108656, Jul. 2022,
doi: 10.1016/J.PATCOG.2022.108656.
[11] F. Renard, S. Guedria, N. de Palma, and N. Vuillerme, “Variability and reproducibility in deep learning
for medical image segmentation,” Sci. Rep., vol. 10, no. 1, p. 13724, Aug. 2020.
[12] B. M. Rocha et al., “An open access database for the evaluation of respiratory sound classification
algorithms,” Physiol Meas, vol. 40, no. 3, Mar. 2019, doi: 10.1088/1361-6579/AB03EA.
[13] J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in
Proceedings of the ACM International Conference on Multimedia - MM ’14, New York, New York, USA:
ACM Press, 2014.
[14] G. Garcia, G. Moreira, D. Menotti, and E. Luz, “Inter-Patient ECG Heartbeat Classification with
Temporal VCG Optimized by PSO,” Sci Rep, vol. 7, no. 1, Dec. 2017, doi:
10.1038/S41598-017-098373.
[15] J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-Excitation Networks,” IEEE Trans Pattern</p>
      <p>Anal Mach Intell, vol. 42, no. 8, pp. 2011–2023, Sep. 2017, doi: 10.1109/TPAMI.2019.2913372.
[16] S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “CBAM: Convolutional Block Attention Module,” in Computer
Vision – ECCV 2018, in Lecture notes in computer science. Cham: Springer International Publishing,
2018, pp. 3–19.
[17] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional
networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, Jul.
2017.
[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Dec. 2015.
[19] F. He, T. Liu, and D. Tao, “Why ResNet works? Residuals generalize,” IEEE Trans. Neural Netw. Learn.</p>
      <p>Syst., vol. 31, no. 12, pp. 5349–5362, Dec. 2020.
[20] C. Szegedy, S. Ioffe, and A. Vanhoucke Vincent and Alemi, “Inception-v4, Inception-ResNet and the
impact of residual connections on learning,” Feb. 2016.
[21] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,”</p>
      <p>Oct. 2020, doi: 10.48550/arxiv.2010.11929.
[22] A. Vaswani et al., “Attention Is All You Need,” Adv Neural Inf Process Syst, vol. 2017-December, pp.
5999–6009, Jun. 2017, Accessed: Nov. 11, 2021. [Online]. Available:
https://arxiv.org/abs/1706.03762v5</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Levine</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. D.</given-names>
            <surname>Marciniuk</surname>
          </string-name>
          , “
          <source>Global Impact of Respiratory Disease: What Can We Do</source>
          , Together, to Make a Difference?,” Chest, vol.
          <volume>161</volume>
          , no.
          <issue>5</issue>
          , p.
          <fpage>1153</fpage>
          , May
          <year>2022</year>
          , doi: 10.1016/J.CHEST.
          <year>2022</year>
          .
          <volume>01</volume>
          .014.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Chimienti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Danzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Gattulli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Impedovo</surname>
          </string-name>
          , G. Pirlo, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Veneto</surname>
          </string-name>
          , “
          <article-title>Behavioral Analysis for User Satisfaction</article-title>
          ,
          <source>” Proceedings - 2022 IEEE 8th International Conference on Multimedia Big Data, BigMM</source>
          <year>2022</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>119</lpage>
          ,
          <year>2022</year>
          , doi: 10.1109/BIGMM55396.
          <year>2022</year>
          .
          <volume>00027</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Gattulli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Impedovo</surname>
          </string-name>
          , G. Pirlo, and G. Semeraro, “
          <source>Early Dementia Identification: On the Use of Random Handwriting Strokes,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          , vol.
          <volume>13424</volume>
          LNCS, pp.
          <fpage>285</fpage>
          -
          <lpage>300</lpage>
          ,
          <year>2022</year>
          , doi: 10.1007/978-3-
          <fpage>031</fpage>
          -19745-1_
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pernkopf</surname>
          </string-name>
          , “
          <article-title>Lung Sound Classification Using Co-tuning and</article-title>
          <string-name>
            <surname>Stochastic Normalization</surname>
          </string-name>
          ,”
          <year>2021</year>
          , Accessed: Jan.
          <volume>31</volume>
          ,
          <year>2023</year>
          . [Online]. Available: https://github.com/makcedward/nlpaug I.
          <article-title>Moummad and N. Farrugia, “SUPERVISED CONTRASTIVE LEARNING FOR RESPIRATORY SOUND CLASSIFICATION”</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Xu</surname>
          </string-name>
          , J. Cheng, J. Liu,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          , “
          <article-title>ARSC-Net: Adventitious Respiratory Sound Classification Network Using Parallel Paths with Channel-Spatial Attention,”</article-title>
          <source>Proceedings - 2021 IEEE International Conference on Bioinformatics and Biomedicine</source>
          ,
          <string-name>
            <surname>BIBM</surname>
          </string-name>
          <year>2021</year>
          , pp.
          <fpage>1125</fpage>
          -
          <lpage>1130</lpage>
          ,
          <year>2021</year>
          , doi: 10.1109/BIBM52615.
          <year>2021</year>
          .
          <volume>9669787</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          , “
          <article-title>A DOMAIN TRANSFER BASED DATA AUGMENTATION METHOD FOR AUTOMATED RESPIRATORY CLASSIFICATION</article-title>
          ,
          <string-name>
            <surname>”</surname>
            <given-names>ICASSP</given-names>
          </string-name>
          , IEEE International Conference on Acoustics,
          <source>Speech and Signal Processing - Proceedings</source>
          , vol. 2022-May, pp.
          <fpage>9017</fpage>
          -
          <lpage>9021</lpage>
          ,
          <year>2022</year>
          , doi: 10.1109/ICASSP43922.
          <year>2022</year>
          .
          <volume>9746941</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          et al.,
          <article-title>“Lungbrn: A smart digital stethoscope for detecting respiratory disease using bi-resnet deep learning algorithm</article-title>
          ,”
          <article-title>BioCAS 2019 - Biomedical Circuits and Systems Conference</article-title>
          , Proceedings, Oct.
          <year>2019</year>
          , doi: 10.1109/BIOCAS.
          <year>2019</year>
          .
          <volume>8919021</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          , “
          <article-title>LungRN+NL: An improved adventitious lung sound classification using nonlocal block resnet neural network with mixup data augmentation</article-title>
          ,
          <source>” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH</source>
          , vol. 2020-October, pp.
          <fpage>2902</fpage>
          -
          <lpage>2906</lpage>
          ,
          <year>2020</year>
          , doi: 10.21437/INTERSPEECH.2020-
          <volume>2487</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>