<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Instance-based bird species identi cation with undiscriminant features pruning - LifeCLEF 2014</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexis Joly</string-name>
          <email>alexis.joly@inria.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Julien Champ</string-name>
          <email>julien.champ@inra.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olivier Buisson</string-name>
          <email>olivier.buisson@inra.fr</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Inra</institution>
          ,
          <addr-line>AMAP, LIRMM, Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Inria, LIRMM</institution>
          ,
          <addr-line>Montpellier</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Institut National de l'Audiovisuel (INA)</institution>
          ,
          <addr-line>Bry-sur-Marne</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>625</fpage>
      <lpage>633</lpage>
      <abstract>
        <p>This paper reports the participation of Inria to the audiobased bird species identi cation challenge of LifeCLEF 2014 campaign. Inspired by recent works on ne-grained image classi cation, we introduce an instance-based classi cation scheme based on the dense indexing of MFCC features and the pruning of the non-discriminant ones. To make such strategy scalable to the 30M of MFCC features extracted from the tens of thousands audio recordings of the training set, we used highdimensional hashing techniques coupled with an e cient approximate nearest neighbors search algorithm with controlled quality. Further improvements are obtained by (i) using a sliding classi er with max pooling (ii) weighting the query features according to their semantic coherence (iii) making use of the metadata to lter incoherent species. Results show the e ectiveness of the proposed technique which ranked 3rd among the 10 participating groups.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Building accurate knowledge of the identity, the geographic distribution and the
evolution of living species is essential for a sustainable development of
humanity as well as for biodiversity conservation. In this context, using multimedia
identi cation tools is considered as one of the most promising solution to help
bridging the taxonomic, i.e. the di culty for common people to name observed
living organisms and then produce or access to useful knowledge. The LifeCLEF
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] lab proposes to evaluate this challenge in the continuity of the image-based
plant identi cation task was run within ImageCLEF the years before but with
a broader scope (considering birds and sh in addition to plants and audio and
video contents in addition to images). This paper particularly reports the
participation of Inria ZENITH research group to the audio-based bird identi cation
task. Inspired by some recent works on ne-grained image classi cation [
        <xref ref-type="bibr" rid="ref5 ref8">8, 5</xref>
        ],
we introduce an instance-based classi cation scheme based on the dense
indexing of MFCC features and the pruning of the non-discriminant ones. Section 2
describes the preliminary audio processings and features extraction steps.
Section 3 then presents our raw instance-based classi cation scheme making use of
high-dimensional hashing techniques coupled with an e cient approximate
nearest neighbors search algorithm with controlled quality. Section 4 then introduces
the additional semantic ltering and weighting rules that are used to improve the
raw scheme. Section 6 nally reports our o cial results within LifeCLEF 2014
as well as few additional experiments conducted to evaluate the contribution of
the di erent steps.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Pre-processing and features extraction</title>
      <p>The dataset used for this challenge is composed of 14,027 audio recordings
belonging to 501 bird species from Brazil area. As various recording devices are
used, and because it is di cult to capture these sounds as birds are often far
away from the recording devices, many recordings contains a lot of noise. To
overcome this problem, we used SoX, the "Swiss Army knife of sound
processing programs"4. As a rst step, we used the noisered specialised lter, to lter
out noise from the audio, and then we reduce the length of large (i.e. &gt; 0:1s)
silent passages from audio les to 0:1s. In order to obtain audio les with
ideally no more noise but still enough signal, we tried removing as much noise as
possible (using the noisered amount parameter) while guaranteeing that the
resulting audio le was at least 20% the size of the initial audio record. After this
pre-processing step, we used an open source software framework, marsyas5, to
extract MFCC features with parameters based on the provided audio features
in the Birdclef task : MFCC are computed on windows of 11.6 ms, each 3.9
ms, and we additionally derive their speed resulting in 26-dimensional feature
vectors (13+13) for each frame.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Raw instance-based classi cation scheme</title>
      <p>We consider a training set S of jSj audio records weakly annotated with one
or several labels among jCj classes (corresponding to one or several bird species
recognizable in the record). Each picture is described by a set of MFCC features
forming a reference dataset of N features xi. Our instance-based classi cation
scheme can be summarized as follows: MFCC features are extracted from the
query record IQ and searched independently in the reference set using an e cient
k-NN search scheme. A sliding vote procedure followed by a max pooling step is
then used to determine the best matching interval of each retrieved record and
rank them accordingly. A strong classi er is nally derived from the ranked list
of retrieved records via a simple top-K classi er (K=30 in all our experiments).
The details of these di erent steps are provided below:
4 http://sox.sourceforge.net/
5 http://marsyas.info/</p>
      <p>
        Hash-based k-NN search. The approximate k-Nearest Neighbors (k-NN)
of each MFCC feature xjQ belonging to a query record IQ are computed e
ciently thanks to the hash-based multi-probe search method introduced in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Its principle is to train an adaptive search model at indexing time through kernel
density estimates computed on exact k-NN samples. This model is used at search
time to select the most probable buckets to be visited so as to retrieve on average
a fraction of the real K-NN's ( was set to 0:80 in all our experiments). The
advantage of this method compared to other state-of-the-art methods (such as
PQ-code [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] or Soft-Assignment [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) is that it allows controlling accurately the
quality of the retrieved k-NN while being applicable to any quantization function
and metrics.
      </p>
      <p>
        In our case, the original scheme is transposed to the use of a more e ective hash
function and to the application of the Hamming Embedding principle [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. More
precisely, we used RMMH [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], a recent data-dependent hash function family, in
order to embed the original feature vectors in compact binary hash codes of 128
bits (the parameter M of RMMH was xed to M = 32). The m = log2(N ) rst
bits of the hash codes are used as keys of a hash map containing the N
128bits hash codes divided into the 2m buckets. Still at indexing time, the search
model is estimated based on the exact k-NN's in the 128-bits Hamming space
rather than in the original space. It has actually be shown in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] that RMMH
somehow works as an unsupervised metric learning technique and can provide
better matches in the embedded space than in the original one. At search time,
the query feature xjQ is also embedded via RMMH and the resulting 128-bits
hash code hjQ is passed to the probabilistic multi-probe algorithm (see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] for
more details). Once the most probable buckets have been selected, the k nearest
neighbors are computed as the k nearest hash codes according to the Hamming
distance between the query hash code hjQ and the hash codes belonging to the
selected buckets. The value of k was trained by cross-validation and nally xed
to k = 200 in all our runs.
      </p>
      <p>Sliding vote and max pooling. A voting scheme is used to merge the
raw independent matches produced by the k-NN search of each MFCC feature
and return a ranked list of records. To increase the robusteness of the matching
and reduce the in uence of outliers and background features, the score of each
record is computed by rst voting within a sliding window (centered around
each frame) and then keeping the max score over the whole record. The size of
the sliding window was trained by cross-validation and then xed to s = 1000
frames in all our runs (resulting in a sliding window of 3.9 seconds).</p>
      <p>Top-K classi er. The nal instance-based classi er simply consists in
summing the scores of the records of each class Cj within the top-K retrieved records.
For a given test record IQ, the classi er then returns a ranked list of species
sorted by their score SQ(Cj ).</p>
    </sec>
    <sec id="sec-4">
      <title>Semantic pruning and weighting</title>
      <p>As we are in the context of weakly annotated audio records with multiple classes
(primary and secondary species) and highly cluttered contexts, we suggest
improving the raw matching scheme by semantically pruning the non-discriminant
training features (o ine) and weighting the query features according to the
semantic coherence of their k nearest neighbours (online).</p>
      <p>O ine training features pruning. To select the most discriminant
features of the training set, we rst compute the full k-nearest neighbors graph of
the training set by searching each individual MFCC feature xi of each training
audio record with the hash-based k-NN search technique described in section
3. We then a ect a discrimination score f (xi) to each feature computed as the
percentage of k-nearest neighbors having the same weak label than the feature
itself. We nally removed from the index all the features having a discrimination
score equal to zero. Note that we did experiment several other strategies such
as keeping all features in the index and using the discrimination scores f (xi) as
a weights within the vote of the search phase. This did not improve the results
over the brute-force pruning of non-discriminant features we nally used (we
experimented di erent regularization function and di erent values of k). This even
degraded the performances over the raw instance-based classi cation scheme due
to over- tting. Too much weights are actually concentrated on very few features
so that the contribution of the vast majority of useful features with moderate
scores remains negligeable in the vote. Our pruning strategy allows keeping the
contribution of these features by removing only the very bad ones with a null
discrimination score. It is important to note that the fraction of such bad
features is still large (about XX% of the 30M MFCC features). Removing them
consistently reduce the noise of the dataset as well as the size of the index and
the search complexity.</p>
      <p>Besides, we also explored alternative discrimination scoring strategy considering
the reverse k-nn's rather the direct k-nn's for the computation of f (xi) (in a
more Bayesian way). This lead to slightly worse results. Using the product of
both the direct and the reverse discrimination scores lead to similar conclusions.</p>
      <p>Online query features weighting. Similarly to the audio records of the
training set, test records might also contain a lot of unuseful features
corresponding to background noise or intervals of time where two much audible species
overlap. We therefore also compute a discrimination score (referred as fQ) for
each MFCC feature xjQ belonging to a query record IQ. A weak label ljQ is rst
estimated for each xjQ as the most represented label within the k-nearest
neighbors of xjQ (still computed by the hash-based k-nn search method described in
section 3). Similarly to f (), fQ() is then computed as the percentage of k-nearest
neighbors having the same weak label than the feature itself (i.e. the percentage
of k-nearest neighbors whose label is equal to ljQ).</p>
    </sec>
    <sec id="sec-5">
      <title>Metadata</title>
      <p>To further improve the identi cation performances, we explored the use of
metadata as a way to re-rank the returned species. The basic principle is to reduce the
score SQ(Cj ) of the species whose metadata values distribution in the training
set does not t the value of the metadata of the query record. More particularly,
we focused on three types of metadata that did bring some improvements in
our cross-validation tests: geo-location, altitude and time-of-day. Revised scores
SQ0(Cj ) are computed from the original ones according to:</p>
      <p>SQ0(Cj ) = SQ(Cj )
2
alt
e 2 a2lt</p>
      <p>g2eo
e 2 g2eo</p>
      <p>2
e 2 hh2oouurr
(1)
where the y's are equal to the di erence between the metadata value yq of the
query (e.g. the altitude of the query observation) and the closest metadata value
of all training records belonging to class Cj (e.g. the closest observation in terms
of altitude). The radial basis function allows preserving the score of the species
having at least one observation close to the query. On the other side, the score
of the species that do not have any hit close to the query tends to zero when
the distance to the closest observation grows. Note that we could have sum the
RBF weights on all training records rather than considering only the closest
one. This would have lead to a classical kernel density estimation and the weight
a ected to each species whould have been equivalent to its likelihood according
to the considered metadata. But it is important to note that the distribution
of the observations in the training set is biased in many ways. The density of
the observations is typically more correlated to the observer's displacements
than to the density of plants. Our model rather rely on the proved presence
of a species at a given place (or time of day) independently of the density of
the observations of that species at that place (or time of day). The values of the
sigmay's normalization factors were set proportionally to the standard deviation
of the distance to the nearest neighbor of any observation in the dataset.
6
6.1</p>
    </sec>
    <sec id="sec-6">
      <title>Experiments and results</title>
      <p>
        Dataset and task
The LifeCLEF 2014 bird dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is built from the Xeno-canto collaborative
database 6 involving at the time of writing more than 140k audio records covering
8700 bird species observed all around the world thanks to the active work of more
than 1400 contributors. The subset of Xeno-canto data used for the 2014 edition
of the task contains 14,027 audio recordings belonging to 501 bird species in
Brazil area, i.e. the ones having the more recordings in Xeno-canto database.
The dataset contains minimally 15 recordings per species (maximum 91) and
minimally 10 di erent recordists, maximally 42, per species.Audio records are
6 http://www.xeno-canto.org/
associated to various metadata such as the type of sound (call, song, alarm, ight,
etc.), the date and localization of the observations (from which rich statistics on
species distribution can be derived), some textual comments of the authors,
multilingual common names and collaborative quality ratings (more details can
be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). The task will be evaluated as a bird species retrieval task. A part
of the collection will be delivered as a training set available a couple of months
before the remaining data is delivered. The goal will be to retrieve the singing
species among the top-k returned for each of the undetermined observation of
the test set. Participants will be allowed to use any of the provided metadata
complementary to the audio content.
To tune the parameters of our instance-based retrieval scheme, we randomly
split the training set in 3=4 for training and 1=4 for testing. We then measured
the Mean Average Precision of our method at the audio records level, i.e. before
applying the nal top-K classi er at the species level. Some of the intermediate
results are displayed in Table 1 to illustrate the contribution of the di erent
steps of our method.
6.3
      </p>
      <p>Submitted runs and o</p>
      <p>cial results
We submitted two runs to the o cial competition:</p>
      <p>INRIA Zenith Run 1: Our instance-based classi cation scheme with
semantic pruning &amp; weighting but WITHOUT the metadata oriented reranking</p>
      <p>INRIA Zenith Run 2: Our full instance-based classi cation scheme with
both the semantic pruning &amp; weighting AND the metadata oriented reranking</p>
      <p>The o cial results are reported in Figure 1 and Table 2. We globally ranked
as the third team over the ten participants and as the second one if we consider
only the purely audio-based runs. Our best run achieved a mAP of 0; 365 when
considering only the primary species of each test recording (vs. 0:317 when
considering the background species as well). Compared to the mAP of our second
run equals to 0:328 (0:281 with the background species), it shows that the
contribution of our reranking scheme making use of the metatata is about 3:6 points
of mAP.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Andoni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Indyk</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions</article-title>
          .
          <source>In: Foundations of Computer Science</source>
          ,
          <year>2006</year>
          . FOCS'
          <volume>06</volume>
          . 47th Annual IEEE Symposium on. pp.
          <volume>459</volume>
          {
          <fpage>468</fpage>
          .
          <string-name>
            <surname>Ieee</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Lifeclef bird identi cation task 2014</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buisson</surname>
            ,
            <given-names>O.:</given-names>
          </string-name>
          <article-title>A posteriori multi-probe locality sensitive hashing</article-title>
          . In: ElSaddik,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Vuong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Griwodz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bimbo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.D.</given-names>
            ,
            <surname>Candan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.S.</given-names>
            ,
            <surname>Jaimes</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>ACM International Conference on Multimedia (MM'08)</source>
          . pp.
          <volume>209</volume>
          {
          <fpage>218</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (10
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buisson</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Random maximum margin hashing</article-title>
          . In: CVPR. IEEE,
          <string-name>
            <surname>Colorado</surname>
            <given-names>springs</given-names>
          </string-name>
          ,
          <source>United States (Jun</source>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Goeau, H.,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bakic</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barbe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Selmi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yahiaoui</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carre</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mouysset</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Molino</surname>
            ,
            <given-names>J.F.</given-names>
          </string-name>
          , et al.:
          <article-title>Interactive plant identi cation based on social image data</article-title>
          .
          <source>Ecological Informatics</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Joly</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Muller, H., Goeau, H.,
          <string-name>
            <surname>Glotin</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spampinato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rauber</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bonnet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vellinga</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          , Fisher,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Lifeclef 2014: multimedia life species identi cation challenges</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Jegou</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Douze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmid</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Product quantization for nearest neighbor search</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          ,
          <source>IEEE Transactions on 33(1)</source>
          ,
          <volume>117</volume>
          {
          <fpage>128</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Krapac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perronnin</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Furon</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jegou</surname>
          </string-name>
          , H.:
          <article-title>Instance classi cation with prototype selection</article-title>
          .
          <source>In: ICMR - ACM International Conference on Multimedia Retrieval</source>
          . Glasgow,
          <string-name>
            <surname>Royaume-Uni</surname>
          </string-name>
          (
          <year>Feb 2014</year>
          ), http://hal.inria.fr/hal-00942275
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Philbin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chum</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sivic</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Lost in quantization: Improving particular object retrieval in large scale image databases</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>