<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Inception-v3 Based Method of LifeCLEF 2019 Bird Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jisheng Bai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bolun Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chen Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jianfeng Chen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhong-Hua Fu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Northwestern Polytechnical University</institution>
          ,
          <addr-line>Xi'an</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present a method of bird recognition based on Inception-v3. The goal of the LifeCLEF2019 Bird Recognition is to detect and classify 659 bird species within the provided soundscape recordings. Log-Mel spectrograms are extracted as features and Inception-v3 is used for bird sound detection. Some data augmentation techniques are applied to improve the robustness and generalization of the model. Finally, we evaluated our system in BirdCLEF test data and achieved 0.055 of classi cation mean average precision (c-mAP).</p>
      </abstract>
      <kwd-group>
        <kwd>Bird sound classi cation Inception-v3 tion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Deep learning is proven to outperform traditional methods in bird sound
classication [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Convolutional neural networks(CNNs) architecture performs well on
many computer vision tasks and the convergence of image-based architectures
such as Inception-v4 can obtain best performance in sound classi cation or what
ever the targeted domain [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        The training data of BirdCLEF2019 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which is a sub task of LifeCLEF2019
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] contains about 50,000 recordings taken from xeno-canto.org and covers 659
common species from North and South America [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. More than 15 and up to
100 recordings are contained per species. And validate split contains 77
recordings. All recordings vary in quality, sampling rate and encoding. Each recording
includes metadata providing information of location, latitude, longitude, etc.
      </p>
      <p>To recognize 659 species and train such amount of recordings, we use
Inception networks instead of shallow CNN architectures. As for features, we selected
log-Mel spectrogram as input. Data augmentation methods are applied during
the preprocessing.</p>
      <p>We use Ttensor ow to train model and python librosa library to calculate
features.</p>
    </sec>
    <sec id="sec-2">
      <title>Data preparation</title>
      <sec id="sec-2-1">
        <title>Audio processing</title>
        <p>
          To separate bird sound and background noise, similar method is applied. As it is
presented in [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] and used in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we refer to their methods and divide all
recordings into 659 di erent bird song species and one total noise class. Details
are described as following:
{ Every recording is read in a sample rate of 44100Hz.
{ Short-time Fourier transform(STFT) function is use to calculate
spectrogram with a window length of 512 and hop length of 256.
{ Then we calculate each row and column median, then we set every element
in spectrogram to 1 if it is three times bigger than the median of its related
row and column, otherwise its set to 0.
{ Then we apply binary erosion and dilation to distinguish noise and signal
part. The lter size is 4 by 4 square.
{ Here we create a one-dimension vector named indicator vector, its ith
element is set to 1 if its related column has at least one 1, or it is 0.
{ Finally, we smooth the indicator vector twice by a dilation lter of size 4 by
1. And we use it as a mask to divide original bird recordings. Every recording
can be divided into many signal and noise parts, all signal parts are
concatenated as one and the same as noise.
        </p>
        <p>We cut all recordings of every species into 5 seconds parts, because we would
train model, predict validate and test data every 5 seconds. After all the steps,
we can get 659 folders contain every species of 5s recordings and one noise folder
of all the noise parts.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Data augmentation</title>
        <p>
          Data augmentation techniques are widely applied in last few years results. All
recordings are resampled to 22050 Hz and then ltered by a high pass lter.Then
some similar time and frequency augmentation methods used in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] are
described as following:
{ Read a bird sound le from random position (it starts from beginning if it
reach the end).
{ Add most four noise les on the top of a bird sound le with independent
chance of 0.5. Meanwhile, a dampening factor of 0 to 0.5 are multiplied for
each noise le .(In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the greatest impact on identi cation performance is
gained by adding background noise. Many systems also use noise overlay as
one of the data augmentation methods to improve performance.)
{ Using STFT to generate spectrogram from a sound le with a window size
of 1024 and hop length of 512.
{ Normalization and logarithm is applied to calculate log-Mel spectrogram
of 256 Mel-bands, frequencies beyond 10500Hz and lower than 200Hz are
removed.
{ Due to the size of Inception input, we duplicate the grayscale spectrogram
to all three channels. And di erent interpolation lters are applied to resize
the spectrogram.
{ Finally the spectrograms are resized into 299*299*3 to t the input size of
        </p>
        <p>Inception.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Network architecture</title>
      <sec id="sec-3-1">
        <title>Transfer learning from Inception-v3</title>
        <p>
          Inception-v3 is one of the state of art architectures in image classi cation
challenge [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. And it is con rmed that Inception-based convolutional neural
networks on Mel spectrograms provide the best performance [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The best network
for bird song detection seems to be the Inception-v3 architecture and it preforms
better than even the more recent architectures [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. So we selected Inception-v3
as our base model.
        </p>
        <p>
          Inception models were ne-tuned using neural networks pre-trained on the Large
Scale Visual Recognition Challenge (ILSVRC) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] version of ImageNet, a dataset
with almost 1.5 million photographs of 1000 object categories scraped from the
web. As it is mentioned in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], strat training model with pre-trained weights
can quickly train and get better performance. But if train model only with last
classi cation layers can lead to worse result, also re-train the whole network cant
reach the best performance either.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Training strategy</title>
        <p>During the training, categorical cross entropy was used as loss function and
stochastic gradient descent as optimizer with Nesterov momentum 0.9, weight
decay of 1e-4 and a constant learning rate of 0.01.</p>
        <p>We generated 20 di erent folders as training data, every folder was augmented
with di erent parameters. We trained these folders with a train batch of 72 and
train random order for 50 epochs.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>The evaluation metric is the classi cation mean Average Precision (c-mAP),
considering each class c of the ground truth as a query. This means that for
each class c, all predictions are extracted from the run le with ClassId(c), rank
them by decreasing probability and compute the average precision for that class,
which can be expressed as
c
mAP =</p>
      <p>PC
c 1 AveP (c)</p>
      <p>C
AveP (c) =</p>
      <p>Pn
k=1 P (k)</p>
      <p>rel(k)
nrel(c)
where C is the number of species in the ground truth and AveP (c) is the average
precision for a given species c computed as:
where k is the rank of an item in the list of the predicted segments containing c,
n is the total number of predicted segments containing c, P (k) is the precision
at cut-o k in the list, rel(k) is an indicator function equaling 1 if the segment
at rank k is a relevant one (i.e. is labeled as containing c in the ground truth)
and nrel is the total number of relevant segments for c.</p>
      <p>On the validation dataset, we selected max 100 probabilities and it got a
c-mAP score of 0.088 and r-mAP (retrieval mean Average Precision) of 0.176.
Meanwhile, the max 5 probabilities turned out to be 0.068 and 0.156.
{ result0: Due to the limited time, it is a pity that we only submitted 1 run.</p>
      <p>We predicted all the test data and selected max 5 probabilities per 5 seconds
as nal and the only one submission. Finally we got the 3th rank among the
teams and got a c-mAP score of 0.055 and r-mAP of 0.145. Details are showm
in Table 1.
(1)
(2)
{ result1: We submitted another run after the deadline, and it got c-mAP
of 0.065 and r-mAP of 0.164. This run contains max 100 probabilities in a
5-second period in 2.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future work</title>
      <p>We presented a system based on Inception model with some data augmentation
techniques for bird recognition and got nal c-mAP score of 0.055. And there is a
0.01 c-mAP score improvement of evaluating max 100 probabilities compared to
max 5 probabilities in a 5-second sound. To handle more than 50,000 recordings,
we selected Inception-v3 which has less parameters and greater feature extracted
ability. During training, data augmentation methods were applied to prevent
over tting and improve generalization performance.</p>
      <p>Due to the limited time, we could not submit more results and compare the
in uence of di erent parameters or architectures. Ensemble of networks could
signi cantly improve results, and it would be apply next year. We will also focus
on the performance of CRNN and capsule network for bird recognition. Features
can also have great impact on performance sometimes, and some unique data
augmentation should be experimented to detect bird species. There is still a lot
of room to improve in our future work.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgement</title>
      <p>We would like to thank Stefan Kahl, Herv Goau, Alexis Joly, Herve Glotin
and Willem-Pier Vellinga for organizing this task. I especially want to thank
Mario Lasseck for sharing his knowledge. I would also like to thank Northwestern
Polytechnical University CIAIC for computer resource.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>CrowdAI</given-names>
            <surname>Homepage</surname>
          </string-name>
          (
          <year>2019</year>
          ), https://www.crowdai.org/challenges/lifeclef-2019
          <string-name>
            <surname>-</surname>
          </string-name>
          bird-recognition
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Joly</surname>
          </string-name>
          , Herv Goau,
          <string-name>
            <surname>C.B.S.K.M.S.H.G.P.B.W.P.V.R.P.F.R.S.H.M.</surname>
          </string-name>
          <article-title>: Overview of lifeclef 2019: Identi cation of amazonian plants, south &amp; north american birds, and niche prediction</article-title>
          .
          <source>In: Proceedings of CLEF</source>
          <year>2019</year>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Fazeka</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schindler</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lidy</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A multi-modal deep neural network approach to bird-song identi cation</article-title>
          . arXiv preprint arXiv:
          <year>1811</year>
          .
          <volume>04448</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Herv</given-names>
            <surname>Goau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.G.</given-names>
            ,
            <surname>Planque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Vellinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.P.</given-names>
            ,
            <surname>Kahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Joly</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Overview of birdclef 2018: monophone vs. soundscape bird identi cation</article-title>
          .
          <source>CLEF working notes</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Botella</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Overview of lifeclef 2018: a large-scale evaluation of species identi cation and recommendation algorithms in the era of ai</article-title>
          . In:
          <article-title>International Conference of the Cross-Language Evaluation Forum for European Languages</article-title>
          . pp.
          <volume>247</volume>
          {
          <fpage>266</fpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spampinato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lombardo</surname>
            ,
            <given-names>J.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palazzo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Lifeclef 2017 lab overview: multimedia species identi cation challenges</article-title>
          .
          <source>In: International Conference of the CrossLanguage Evaluation Forum for European Languages</source>
          . pp.
          <volume>255</volume>
          {
          <fpage>274</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kahl</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stter</surname>
            ,
            <given-names>F.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Planque</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of birdclef 2019: Large-scale bird recognition in soundscapes</article-title>
          .
          <source>In: CLEF working notes 2019</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Lasseck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Acoustic bird detection with deep convolutional neural networks</article-title>
          .
          <source>In: Proceedings of the Detection and Classi cation of Acoustic Scenes and Events 2018 Workshop (DCASE2018)</source>
          . pp.
          <volume>143</volume>
          {
          <issue>147</issue>
          (November
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lasseck</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Audio-based bird species identi cation with deep convolutional neural networks</article-title>
          .
          <source>Working Notes of CLEF</source>
          <year>2018</year>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Russakovsky</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Ma,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ,
            <surname>Karpathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Khosla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bernstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          , et al.:
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          .
          <source>International journal of computer vision 115(3)</source>
          ,
          <volume>211</volume>
          {
          <fpage>252</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sevilla</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Glotin</surname>
          </string-name>
          , H.:
          <article-title>Audio bird classi cation with inception-v4 extended with time and time-frequency attention mechanisms</article-title>
          .
          <source>In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum</source>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . (
          <year>2017</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-1866/paper 177.pdf
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Sprengel</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jaggi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kilcher</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Audio based bird species identi cation using deep learning techniques</article-title>
          .
          <source>Tech. rep. (</source>
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , Io e, S.,
          <string-name>
            <surname>Shlens</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wojna</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Rethinking the inception architecture for computer vision</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>2818</volume>
          {
          <issue>2826</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>