<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TUC Media Computing at BirdCLEF 2021: Noise augmentation strategies in bird sound classification in combination with DenseNets and ResNets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arunodhayan Sampathkumar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danny Kowerko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universität Chemnitz</institution>
          ,
          <addr-line>Str. der Nationen 62, 09111 Chemnitz</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This research paper presents deep learning techniques for bird recognition to classify 397 species in the BirdCLEF 2021 challenge. The proposed method was inspired by the DCASE2019 audio tagging challenge, which classifies and recognizes diferent sound events. Data augmentations methods like noise augmentation, spectrogram augmentation techniques are used to avoid overfitting and hence generalize the model. The final solution is based on an ensemble of diferent backbone models and splitting the dataset based on geographic locations provided in the test set. Furthermore, framewise post-processing predictions are used to identify the bird events. The best results were obtained from 12 model ensembles with a public and private score of 0.6487 and 0.6034, respectively.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;CNN (Convolutional Neural Network)</kwd>
        <kwd>Bird Recognition</kwd>
        <kwd>Deep learning</kwd>
        <kwd>Data Augmentation</kwd>
        <kwd>Soundscapes</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The paper is organized as follows, section 2 describes the recognition of the birds
including data preparation. Section 3 describes feature extraction, data-augmentation, neural network
architecture and training steps. Section 4 presents the evaluation results followed by the
conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>All the recordings are first converted from ogg to wav format with a sampling rate of 32 kHz.
The soundscape recordings are prepared for validation by cutting them into 5 s chunks according
to the annotations. The background noises are separated from the soundscape recordings based
on parts without bird activity using the provided metadata. Later, these background noises are
used for data augmentation.</p>
      <p>
        Some parts of validation soundscape recordings are merged with Xeno-canto training set [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
for training, while the rest of the recordings are used for cross validation. The training set and
validation set are splitted by 5 stratified folds.
      </p>
      <p>To create more diverse models, 6 diferent sub-datasets are formed targeting diferent locations,
which are later ensembeled.</p>
      <p>As presented in Table 1, sub-datasets are divided based on locations, e.g. Dataset-1 and 2
are prepared based on the locations of test set. With the given latitude and longitude from test
data, a 200 and 400 km radius is marked and most likely occurring species within the radius are
taken into account.</p>
      <p>Dataset-3 consist of species that mainly occur in the given test recording locations. Dataset-4
belongs to species that mainly present in the recording locations of the United States. Dataset-5
belongs to species that mainly present in the recording locations of Costa Rica. Dataset 6
belongs to species that mainly present in the recording locations of Colombia.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Spectrogram Extraction</title>
        <p>The recordings are sampled at 32 kHz sample rate and trimmed to 30 s long chunks because
if we use a shorter window size, it may not include any sound events or include some sound
events which may be a noisy event or a background species as shown in Figure 1, for this reason
longer chunks are preferred. To make the model learn correctly, we need to make each label
correspond to call-events of each species. First, we compute a Short Time Fourier Transform
(STFT) with a Hann window of 1024 samples and hop size of 384 samples and mel bins of 64,
retain only the magnitude and then followed by applying a log-mel filter banks from 150 Hz to
15 kHz.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. DataAugmentation</title>
        <p>
          Diferent data augmentation techniques are performed to increase the model performance and
improve its generalization to real time data. The following data augmentation methods are
applied to raw audio recordings:
• 30 s chunk at random position for training
• Gaussian Noise
• Gaussian Signal to Noise Ratio (SNR)
• Adding primary background noise
• Adding secondary background noise
• Mixup augmentation
• Spectrogram augmentation
Primary background noise: The train recordings and test recordings contain domain shifts.
To make train recordings more robust, diferent noises are incorporated as background noise.
Besides the noise extracted from soundscape recordings, recordings without bird activity from
BAD (Bird Audio Detection) are used [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Apart from these two noise systems, generated pink
noises are used [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>Secondary background noise: Mixing various (bursts of overlapping) short audios in the
train recording with random pauses between. Noises like wind, car sounds, insects, rain and
thunder are used.</p>
        <p>
          Mixup: Audio chunks from random files are mixed together, and their corresponding labels are
added as shown in Figure 2. The mixup augmentations are constructed using the formulae [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
 =  ·  + (1 −  ) · 
 =  ·  + (1 −  ) · 
(1)
(2)
where (, ) and ( ,  ) are the two randomly selected recordings for mixup, and  is the
mix ratio with values from [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. Mixup increases the robustness of the model and generalizes
well in real time data because soundscape data typically contain more than one species occurring
in the event window.
        </p>
        <p>Gaussian SNR: The Gaussian noise applied to the samples with random signal to Noise
Ratio (SNR).</p>
        <p>
          Spectrogram augmentation Time stretching and pitch shifting are the augmentations tried
on spectrograms. Time stretching is the process of changing the speed/duration of sound
without afecting the pitch of sound. It takes the wave samples and a factor by which the input is
stretched by a factor of 0.4 which has a small diference with the original sample. Pitch shifting
is the process of changing the pitch of the sound without afecting the speed. It takes wave
samples, sample rate and number of steps(-4 to +4) through which the pitch must be shifted.
These methods are performed using LIBROSA library [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Network Architecture</title>
        <p>
          In recent years Convolutional Neural Networks (CNNs) have been successfully used for audio
recognition and detection. The architecture design for the bird recognition task was inspired
from DCASE2019 PANNs (Large-Scale Pretrained Audio Neural Networks for Audio Pattern
Recognition) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. PANNs are developed based on cross talk CNN with an extra fully connected
layer added to the penultimate layer of the CNN.
        </p>
        <p>
          From the previous BirdCLEF challenges, deeper CNN networks performed well when compared
with wider or shallow CNNs. Hence the backbone network for this challenge used are ResNets
[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and DenseNets [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>ResNet: Deeper CNNs perform well on audio recognition tasks. The challenge in very deep
CNNs is that the gradients do not propagate properly. To solve this issue, ResNets introduced
shortcut connections between convolutional layers.</p>
        <p>DenseNets: DenseNets were designed to improve the information flow between layers, a
diferent connectivity pattern was introduced with direct connections from any layer to all
subsequent layers. The change of feature maps is facilitated by down-sampling the architecture
by dividing the network into multiple densely connections, making the network deeper.</p>
        <p>In this task, after log-mel feature extraction, the inputs are passed to ResNets/DenseNets
by removing the last fully connected layers and extract only features. Then, a modified 1D
attention based fully connected layer is attached to ResNet. The output of this network is a
dictionary which contains clipwise and framewise outputs. Table 3 illustrates the modified
networks used in this research.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Training setup</title>
        <p>
          Our CNNs used a model pretrained on ImageNet [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], and were fine-tuned with training data
previously converted to log-mel scaled spectrogram images. Machine learning functionality
was implemented using the PyTorch library, while audio (pre-)processing functionality like
spectrogram decomposition was realized using the Librosa library.
        </p>
        <p>
          The networks are trained for 75 epochs without mixup augmentation and 150 epochs with
mixup augmentation. The loss function used here is BCE-focal-2way loss (binary cross entropy)
and sed- scaled-pos-neg-focalloss(FL) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>SED-Scaled-Pos-Neg-Focal loss: It focuses on primary labels and secondary labels loss.</p>
        <p>= (, ℎ)
  −  −  = (1 − ()) · (1 − ) · 
  −  −  = () · (1 − ) · 
  = (4) + (5)
   = (  &gt; 0.0,  − ,  − )
  −  =    ·</p>
        <p>Oneslike are tensors filled with the scalar value ‘1‘and zerolike are tensors filled with the
scalar value ‘0‘.</p>
        <p>The grouped output losses are Focalloss-scaled, bceloss, Focalloss.</p>
        <p>The optimizer used here is AdamW optimizer with weight decay 0.1. The learning rate scheduler
is a combination of merging Cosine Annealing Scheduler with warmup (cycle-size is
epochlength  number of epochs) + LinearCyclicalScheduler (cycle-size is epoch-length 
2). The initial learning rate is 0.001. Background species metadata are not taken into account.
(3)
(4)
(5)
(6)
(7)
(8)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation results</title>
      <p>This section illustrates the combination of models, ensemble techniques and evaluation score
on the test set. Table 4, Table 5, and Table 6 illustrate the diferent strategies used in the model
and their respective results based on public and private leadership board. The ensemble method
used here is voting.</p>
      <p>Ensemble RUN 1: Models M1, M2, M9, M10, M14, and M15 are used. This model contains
397 classes and used diferent background noises. The clipwise threshold, framewise threshold
and number of votes are discussed in the table 7.
Ensemble RUN 2: Models M3, M4, M5, M11, M12, M13, M16, M17, and M18 are used. This
model contains 391, 345 and 273 classes, and used diferent background noises. The clipwise
threshold, framewise threshold and number of votes are discussed in the table 8. The 9 model
ensemble is a combination of diferent classes which are split based on location and achieved
the top score of 0.6741 in our public score with less False Positives and 0.6024 in the private
score.
Ensemble RUN 3: Models M7, M8, and M9 are used. This model contains 162, 187 and
263 classes and used diferent background noises. The clipwise threshold, framewise threshold
and number of votes are discussed in the table 9. This ensemble method comprises of 3 diferent
locations. The 3 model ensemble based on location split has a score of 0.6799 on public score
with less FP compared to 0.5951 on private score.</p>
      <p>Ensemble RUN 4: Models M2, M3, M4, M6, M7, M8, M10, M11, M12, M15, M16, and M17 are
used. This RUN takes best performing models which used diferent background noises. The
clipwise threshold, framewise threshold and number of votes are discussed in Table 10. This
ensemble method comprises of 12 diferent models with diferent backbone models and diferent
classes split based on location yields the best private score of 0.6034.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>The current approach attained an F1 score of 0.6034 in the private leadership board. Recognizing
all bird species is still challenging because of domain shift in train (clean audio) and test (noisy
audio) data. The train dataset consists of weakly labeled (clipwise labeling) and there were many
background species present. A multi-label annotation of train files could have significantly
improved the models in bird recognition.</p>
      <p>There are several techniques to improve this bird recognition task, methods like vision
transforms and removal of no bird activity from the train dataset. A promising approach would be
the feature extraction by merging two diferent features in combination with polyphonic event
detection. Better inference techniques could focus more on locations, e.g. using the ebird API
and thresholds for each species separately, to achieve better recognition of bird events.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mühling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Franz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Korfhage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Freisleben</surname>
          </string-name>
          ,
          <article-title>Bird species recognition via neural architecture search</article-title>
          , in: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Eickhof</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Névéol (Eds.), Working Notes of CLEF 2020 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Thessaloniki, Greece,
          <source>September 22-25</source>
          ,
          <year>2020</year>
          , volume
          <volume>2696</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2696</volume>
          /paper_188.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Picek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lorieul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cole</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Deneu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Servajean</surname>
          </string-name>
          , R. Ruiz De Castañeda, I. Bolon,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dorso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Eggel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          , Overview of lifeclef
          <year>2021</year>
          :
          <article-title>a system-oriented evaluation of automated species identification and species distribution prediction</article-title>
          ,
          <source>in: Proceedings of the Twelfth International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ),
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Klinck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Goëau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planqué</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , Overview of birdclef 2021:
          <article-title>Bird call identification in soundscape recordings</article-title>
          ,
          <source>in: Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lostanlen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Salamon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farnsworth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kelling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Bello</surname>
          </string-name>
          ,
          <article-title>Birdvox-full-night: A dataset and benchmark for avian flight call detection</article-title>
          ,
          <source>in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <string-name>
            <surname>ICASSP</surname>
          </string-name>
          <year>2018</year>
          , Calgary, AB, Canada,
          <source>April 15-20</source>
          ,
          <year>2018</year>
          , IEEE,
          <year>2018</year>
          , pp.
          <fpage>266</fpage>
          -
          <lpage>270</lpage>
          . URL: https://doi.org/10.1109/ICASSP.
          <year>2018</year>
          .
          <volume>8461410</volume>
          . doi:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2018</year>
          .
          <volume>8461410</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. O.</given-names>
            <surname>Smith</surname>
          </string-name>
          , Spectral Audio Signal Processing, http://ccrma.stanford.edu/jos/- sasp/, accessed &lt;date&gt;.
          <source>Online book</source>
          ,
          <year>2011</year>
          edition.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cai</surname>
          </string-name>
          , S. Liu,
          <article-title>Mixup-based acoustic scene classification using multi-channel convolutional neural network</article-title>
          , in: R.
          <string-name>
            <surname>Hong</surname>
            , W. Cheng, T. Yamasaki,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Ngo (Eds.),
          <source>Advances in Multimedia Information Processing - PCM 2018 - 19th Pacific-Rim Conference on Multimedia, Hefei, China, September 21-22</source>
          ,
          <year>2018</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>III</given-names>
          </string-name>
          , volume
          <volume>11166</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2018</year>
          , pp.
          <fpage>14</fpage>
          -
          <lpage>23</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -00764-
          <issue>5</issue>
          _2. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -00764-5\_2.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Iqbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Plumbley</surname>
          </string-name>
          , Panns:
          <article-title>Large-scale pretrained audio neural networks for audio pattern recognition</article-title>
          , CoRR abs/
          <year>1912</year>
          .10211 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1912</year>
          .10211. arXiv:
          <year>1912</year>
          .10211.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>McFee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Ellis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McVicar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Battenberg</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.</surname>
          </string-name>
          <article-title>Nieto, librosa: Audio and music signal analysis in python</article-title>
          ,
          <source>in: Proceedings of the 14th python in science conference</source>
          , volume
          <volume>8</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mueller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Manmatha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Smola</surname>
          </string-name>
          , Resnest:
          <article-title>Split-attention networks</article-title>
          , CoRR abs/
          <year>2004</year>
          .08955 (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2004</year>
          .08955. arXiv:
          <year>2004</year>
          .08955.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>Densely connected convolutional networks</article-title>
          ,
          <source>CoRR abs/1608</source>
          .06993 (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/1608.06993. arXiv:
          <volume>1608</volume>
          .
          <fpage>06993</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <article-title>Focal loss for dense object detection</article-title>
          ,
          <source>CoRR abs/1708</source>
          .
          <year>02002</year>
          (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1708.
          <year>02002</year>
          . arXiv:
          <fpage>1708</fpage>
          .
          <year>02002</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>