<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Audio Bird Classi cation with Inception-v4 extended with Time and Time-Frequency Attention Mechanisms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antoine Sevilla</string-name>
          <email>antoine-sevilla@etud.univ-tln.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Herve Glotin</string-name>
          <email>herve.glotin@univ-tln.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AMU</institution>
          ,
          <addr-line>Univ. Toulon, CNRS, ENSAM, LSIS UMR 7296, DYNI team</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We present an adaptation of the deep convolutional network Inception-v4 tailored to solving bioacoustic classi cation problems. Bird sound classi cation was treated as if it were an image classi cation problem by a transfer learning of Inception. Inception, the state-of-the-art in image classi cation, was used together with an attention algorithm, to (multiscale) time-frequency representations or images of bird sounds. This has resulted in an e cient pipeline, that we call Soundception. Soundception scored highest on all tasks in the BirdClef2017 challenge. It reached 0.714 Mean Average Precision in the task that asked for classication of 1500 bird species. To our knowledge Soundception is currently the most e ective model for biodiversity monitoring of complex soundscapes.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Learning</kwd>
        <kwd>Inception-v4</kwd>
        <kwd>Bird Species Classi cation</kwd>
        <kwd>Transfer Learning</kwd>
        <kwd>Attention Mechanism</kwd>
        <kwd>Sound Detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The main objective of our approach is to create an easy-to-use pipeline of an
acoustic model from an image model. We want to stay in the same framework of
the state-of-the art deep learning [7] Inception-v4, and to transfer it to
soundscape classi cation. Inception-v4 has been pre-trained on imagenet, then we
adapt its inputs and learn a bird acoustic activity detector thanks to the
classi cation outputs joint to an attention mechanism. Then we process a transfer
learning from pre-trained weights on imagenet to bird classi cation. The paper
describes our methodology to e ciently build this model, that we call
'Soundception', in few weeks a reduced GPU resources. We show that Soundception
gives to the best scores of the BirdClef 2017 challenge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In the last section we
discuss on perspectives to increase the accuracy of Soundception.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Audio featuring</title>
      <sec id="sec-2-1">
        <title>Audio to tri-channel time-frequency image</title>
        <p>
          The data representation is a crucial step in any learning process. In our approach,
the representation must be scalable and based on image processing including
some acoustic speci cities, because we take advantage of Google Inception that
is fed by a RGB image. Therefore, we generate three log-spectrograms by fast
Fourier transform at three scales : window size of wi2f0;1;2g = 2(2 i) 128 (i.e.
128, 512, 2048). This fast computation approximates a compressed multi-scale
representation of voices and chirps of birds. We think it improves usual spectral
representations that deal with either temporal or frequency resolution (see usual
representation in [
          <xref ref-type="bibr" rid="ref3 ref4">4, 3</xref>
          ]). Next, we reshape the three spectrograms by bilinear
interpolation, into an optimal dimension for Inception inputs. In sum, our audio
featuring is :
1. Resample the dataset to 22050 Hz sampling rate.
2. Let min duration be the accepted minimum duration of audio sample.
3. Let min subduration be the accepted minimum duration of the subimage.
4. Let d(x) be the duration of the audio sample x.
5. While d(x) &lt; min duration self concatenate x.
6. Compute three log-spectrogram Si(x) with window sizes wi 2 (128; 512; 2048).
7. Remove outliers of the Si(x) distribution to avoid quanti cation error.
8. Resize (bilinear interpolation) Si(x) to an optimal format for Inception:
height = 299 pixels for the frequency dimension, 299 min subduration=1:5
pixels for the time dimension.
9. Concatenate the three Si(x) into one 3 channels RGB multiscale image I.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Data augmentation</title>
        <p>During the training stage, we run data augmentation. Therefore, we use
standard transformations in computer vision. More precisely we run the Inception
preprocessing on the spectrograms Si as random hue, contrast, brightness, and
saturation, plus random crop in time and frequency, as follows :
1. Random choice of an image I in the dataset.
2. Random crop a subimage Ic from I:
{ let hIc = initial height of Ic = 299,
{ let dIc = initial duration of Ic = 15sec: 299 10,
{ set random temporal dilatation factor of Ic uniformly sampled in [0:95; 1:05],
{ set random top of Ic uniformly sampled in [0:96; 1] hIc,
{ set random bottom of Ic uniformly sampled in [0; 0:01] hIc,
{ crop and resize Ic from I with above parameters and random time o set.
3. Vision preprocessing of Ic by hue, contrast, brightness, saturation variations.
4. Add random noise or process local brightness to Ic.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Model speci cation</title>
      <sec id="sec-3-1">
        <title>Transfer learning from Inception-v4 to Soundception</title>
        <p>
          Inception-v4 is the state-of-the-art in computer vision [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. There are several ways
to adapt Inception-v4 to time-frequency analysis, as a simple average pooling on
the time axis, or use recurrent layer at the top of the network or both. Here we
adapt Inception-v4 to make it entirely convolutional on the time domain, the aim
being to make it invariant to temporal translation, and to allow arbitrary
widthsized image. Secondly we add a time and a time-frequency attention mechanisms
into the branches as represented in the synopsis of Inception-v3 Fig 3.
        </p>
        <p>We set dIc = 15 seconds, min d = 60 sec., scale = 1:5 sec. for 299 pixels
in respect to baseline Inception inputs of 299 299 pixels, and the available
VRAM per GPU (12 Go). We do not use speci c bird detection in order to avoid
handcrafted detection which could weaken the complete pipeline. The detection
is processed by an attention mechanism as presented in the next section. It runs
on a large time window (dIc = 15sec:) to increase the probability of bird activity
in the image. The main process results into one time frequency RGB image (as
Fig. 1) per audio le, thus 36492 in BirdClef 2017.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Attention mechanisms in time and time-frequency</title>
        <p>
          Attention mechanisms are gaining popularity. The goal is to focus attention
somewhere or on something. We can learn detection from classi cation by adding
a soft attention mechanism [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Thus, we add to the Inception model two
attention mechanisms : a temporal attention into the auxiliary branch, and a
timefrequency attention in the main branch.
        </p>
        <p>Each attention mechanism is an element wise product of the detector feature
map with the feature maps of the previous Inception layers. These mechanisms
learn how to pass the information and thus play the role of bird activity detectors.</p>
        <p>The rst attention mechanism is the temporal attention de ned by a sigmoid
activation because the bird temporal activities are expected to be binomial in
time.</p>
        <p>The second attention mechanism is de ned by the softmax of the outputs1,
yielding to smooth neuron time-frequency activity distribution.</p>
        <p>Next, based on the sigmoid or softmax of the outputs, we compute the
element wise product between the feature maps of the signal and the feature maps
of the detector.</p>
        <p>
          In Fig. 4 we show how Soundception focuses in time on bird activities. In Fig.
5 we show that in time-frequency Soundception indeed focuses on the loudest
formant/frequency of the bird call.
The training of the model took several days depends of hyper-parameters, using
randomly split training (90%) and (10%) validation sets as in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. For the transfer
learning, we rst train only the top layer of Soundception, then we ne tune all
layers. We split training in di erent stages with di erent batch sizes, according
to the available GPU memory :
1. Train the model with time window of dIc = 15sec:,
2. Train last layers and detectors with mini-batch size of 8,
3. Fine tune all layers with mini-batch size of 4.
        </p>
        <p>There are di erent options to evaluate the prediction on the development set.
We could consider the audio le transform into images I as previously described
with arbitrary temporal size. Here, we optimize the model according to the
average score of the predictions from the subimages Ic of the main images I.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>We report Fig. 6 the o cial scores on the four tasks of each of the challenger.
Our model Soundception wins this challenge in the four tasks. The DYNI UTLN
RUN1 depicted in this paper is the best model in three of the tasks, with 0.714
Mean Average Precision (MAP) on the 1500 species 'traditional records' task,
0.616 MAP with 'background species' task, and 0.288 MAP on the 'Soundscapes
with time-codes' task. It is third in the 'Soundscape without time-codes' task, for
which our other run (DYNI UTLN Run2) which explored di erent parameters
is rst.</p>
      <p>These results are good despite the fact that we had not time to completely
train Soundception on di erent topologies and to develop associated
preprocessings in the four weeks of the challenge.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future work</title>
      <p>In this paper we show how we transferred Inception-v4 from image to the acoustic
domain, and how it learns bird sound detection by itself using attention models.
The results show that it is possible to tackle state-of-the-art sound classi cation
by the transfer learning of e cient pre-existing image classi cation model. This
strategy can be useful to tackle other challenge without pre-segmentation.</p>
      <p>We had not been able to let completely converge the training stage of
Soundception due to the huge computation and GPU needs, however it reaches the
best results in the BirdClef 2017 challenge. There is a lot to be done in this area.
Our current work also explores di erent scalable optimizations to learn audio
to image representations instead of pseudo multi-scale FFT spectrograms. We
currently develop a model with stacked GRU at the top of the network.
Acknowledgements. We thank Laura Bessone for her support in this paper.
We thank XenoCanto, LifeClef team, EADM GDR CNRS MADICS, SABIOD.org
and Amazon Explorama Lodges with Lucio Pando, P. Bucur and M. Trone for
the co-organization of this challenge. This research is also supported by
STICAmSud BRILAAM for South American bioacoustics. We thank TPM, CG83,
UTLN for their support in the Captile project on soundscape analysis. We thank
V. Roger for setting the GPU, and S. Paris for lending them. We are grateful to
the anonymous reviewers. Antoine Sevilla has set up the experimentations and
ideas in this article.
7. M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al.: TensorFlow: Large-scale
machine learning on heterogeneous systems, tensor ow.org (2015)
8. https://research.googleblog.com/2016/03/train-your-own-image-classi
erwith.html</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , S. Io e, V. Vanhoucke:
          <article-title>Inception-v4, inception-resnet and the impact of residual connections on learning</article-title>
          ,
          <source>arXiv:1602.07261</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Show, attend &amp; tell: Neural image caption generation with visual attention, ICML (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>H.</given-names>
            <surname>Goeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Glotin</surname>
          </string-name>
          , WP. Vellinga,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          .
          <source>LifeCLEF Bird Identi cation Task</source>
          <year>2016</year>
          :
          <article-title>The arrival of Deep learning</article-title>
          ,
          <source>CLEF</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. E. Sprengel, YK. Martin Jaggi, T. Hofmann:
          <article-title>Audio based bird species identi cation using deep learning techniques</article-title>
          ,
          <source>CLEF</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Joly</surname>
          </string-name>
          , H. Goeau, H. Glotin,
          <string-name>
            <given-names>C.</given-names>
            <surname>Spampinato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bonnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.-P.</given-names>
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Lombardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Planque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Palazzo</surname>
          </string-name>
          , H. Muller:
          <article-title>LifeCLEF 2017 Lab Overview: multimedia species identi cation challenges (</article-title>
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Xeno</given-names>
            <surname>Canto</surname>
          </string-name>
          <article-title>Foundation: Sharing bird sounds from around the world, www</article-title>
          .xenocanto.org (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>