<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Feature Learning via Mixtures of DCNNs for Fine-Grained Plant Classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chris McCool y</string-name>
          <email>c.mccool@qut.edu.au</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ZongYuan Ge y</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Peter Corke y</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We present the plant classi cation system submitted by the QUT RV team to the LifeCLEF 2016 plant task. Our system learns two deep convolutional neural network models. The rst is a domain-speci c model and the second is a mixture of content speci c models, one for each of the plant organs such as branch, leaf, fruit, ower and stem. We combine these two models and experiments on the PlantCLEF2016 dataset show that this approach provides an improvement over the baseline system with the mean average precision improving from 0:603 to 0:629 on the test set.</p>
      </abstract>
      <kwd-group>
        <kwd>deep convolutional neural network</kwd>
        <kwd>plant classi cation</kwd>
        <kwd>mixture of deep convolutional neural networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Fine-grained image classi cation has received considerable attention recently
with a particular emphasis on classifying various species of birds, dogs and
plants [
        <xref ref-type="bibr" rid="ref1 ref2 ref4">1, 2, 4, 8</xref>
        ]. Fine-grained image classi cation is a challenging computer
vision problem due to the small inter-class variation and large intra-class variation.
Plant classi cation is a particularly important domain because of the
implications for automating agriculture as well as enabling robotic agents to detect and
measure plant distribution and growth.
      </p>
      <p>
        To evaluate the current performance of the state-of-the-art vision
technology for plant recognition, the Plant Identi cation Task of the LifeCLEF
challenge [
        <xref ref-type="bibr" rid="ref5">5,7</xref>
        ] focuses on distinguishing 1000 herb, tree and fern species. This is still
an observation-centered task where several images from seven organs of a plant
are related to one observation. There are seven organs, referred to as content
types, and include images of the entire plant, branch, leaf, fruit, ower, stem or
a leaf scan. In addition to the 1000 known classes, the 2016 PlantCLEF
evaluation includes classes external to this, making this a more open-set recognition
problem.
      </p>
      <p>
        Inspired by [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we use a deep convolutional neural network (DCNN) approach
and learn a separate DCNN for each content type. The DCNN for each content
type is combined using a mixture of DCNNs. Combining this approach with a
standard ne-tuned DCNN improves the mean average precision (mAP) from
0:601 to 0:629 on the test set.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Our Approach</title>
      <p>We propose a system that uses content-types during the training phase, but does
not use this information at test time. This provides a more practical real-world
system that does not require well labelled images from the user. In PlantCLEF
2016 there are 7 organ types ranging from branch through to fruit and stem,
example images are given in Figure 1.</p>
      <p>Our proposed system consists of two key parts. First, we learn a
domaingeneric DCNN termed GCNN which classi es the plant image regardless of
content type. Second, we learn a MixDCNN termed MDCNN which rst learns a
content speci c DCNN for each 6 of the organ types1. We combined the output of
these two systems to form the nal classi cation decision. For all of our systems,
the base network that we use is the GoogLeNet model of Szegedy et al. [9].
We learn a domain-generic DCNN, GCNN , that ignores the content type of
the plant image. This model uses only the class label information to train a
very deep neural network consisting of 22 layers, the GoogLeNet model [9]. To</p>
      <sec id="sec-2-1">
        <title>1 The organ type leaf and leaf scan were combined into one.</title>
        <p>apply this model to plant data we make use of transfer learning to ne-tune the
parameters of this general object classi cation model to the problem at hand,
plant classi cation.</p>
        <p>Transfer learning has been used for a variety fo tasks with one of its earliest
uses for ne-grained classi cation being to learn a bird classi cation model [10].
We use transfer learning to ne-tune the parameters of the GoogLeNet model
by training it for approximately 18 epochs.
2.2</p>
        <sec id="sec-2-1-1">
          <title>MixDCNN</title>
          <p>We learn a MixDCNN, MDCNN , which consists of K DCNNs. This allows
each of the K DCNNs to learn feature appropriate for those samples that have
been assigned to it, which in turn allows us to learn more appropriate and
discrmininative features. We do this by calculating the probability that the k-th
component (DCCN), Sk, is responsible for the t-th sample xt. Such an approach
also allows us to have a system that does not require the content type of the
sample to be labelled at test time.</p>
          <p>For PlantCLEF 2016 there are 7 pre-de ned content types consisting of
images from the entire plant, branch, leaf, fruit, ower, stem or a leaf scan. For the
MixDCNN, we make use of the content type to learn a DCNN that is ne-tuned
(specialised) for a subset of the content types. However, because of the similarity
between the leaf and leaf scan content types we combine them into one. As such
we learn K = 6 content types for the MixDCNN. To train the k-th component
(DCNN) we use the Nk images assigned to this subset Xk = [x1; :::; xNk ], with
their corresponding class labels. We then ne-tune the GoogLeNet model,
similar to Section 2.1, to learn a content-speci c model. Once each content-speci c
DCNN has been trained we then perform joint training using the MixDCNN.</p>
          <p>The K trained content-speci c models are then combined in a MixDCNN
structure, shown in Figure 2. An important aspect of the MixDCNN model is to
calculate the probability that the k-th component is responsible for the sample.
This occupation probability is calculated as,
where there are N = 1000 classes and zk;n;t is classi cation score from the k-th
component for the t-th sample and n-th class. This occupation probability gives
higher weight to components that are con dent about their prediction.</p>
          <p>The nal classi cation score is then given by multiplying the output of the
nal layer from each component by the occupation probability and then summing
over the K components:
zn =</p>
          <p>XK</p>
          <p>k=1 zk;n k
where Ck is the best classi cation result for Sk using the t-th sample:
k =</p>
          <p>expfCkg
PK</p>
          <p>
            c=1 expfCcg
Ck;t = n=m1a::x:Nzk;n;t
(1)
(2)
(3)
This mixes the network outputs together. More details on this method can be
found in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ].
In this section we present a comparative performance evaluation of our four runs.
We rst present the results on the training set and then present the results on
the test set followed by a brief discussion. We use Ca e [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] to learn all of our
models, both domain-speci c and MixDCNN.
          </p>
          <p>At test time our model does not use any content information, rather it
automatically classi es the image with minimal user information. This means we
use all of the 113,205 images of 1,000 classes to train our model. Results on the
training set are given in Table 1, this table shows the result of the MixDCNN
model after training for 2 epochs and 17 epochs. The system submitted was
trained for only 2 epochs2 due to resource and time constraints.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2 Further ne-tuning was performed after submission.</title>
        <p>3.1</p>
        <sec id="sec-2-2-1">
          <title>Results on Test Set</title>
          <p>In this section, we present our submitted results for the PlantCLEF2016
challenge. We submitted four runs:
{ QUT Run 1 is the Baseline result of using a ne-tuned GoogLeNet using all
of the organ types, the rank 1 score submitted for each observation.
{ QUT Run 2 is the MixDCNN system with the rank 1 score submitted for
each observation.
{ QUT Run 3 is the combination of the Baseline and MixDCNN systems, the
rank 1 score was submitted for each observation.
{ QUT Run 4 is the combiation of the Baseline and MixDCNN system with a
threshold to remove potential false positives.</p>
          <p>In Figure 3 we present the overall performance for all of the competitors
using the de ned score metric. It can be seen that our best performing system
is RUN 3 which achieved a score of 0:629. This system, Fusion, consists of the
combination of the Domain-Speci c model, GCNN , with the MixDCNN model,</p>
          <p>MCNN , using equal weight fusion of the classi cation layers. A summary of
these systems is presented in Table 2.</p>
          <p>RUN4 is the same as RUN3 with a preset threshold to remove potential
false positives. The precision of this system is considerably lower than any of the
other systems and shows that choosing this threshold must be done judiciously.
In this paper we presented a domain-speci c and MixDCNN model to perform
automatic classi cation of plant images. The domain-speci c model is learnt by
ne-tuning a well known model speci cally for the plant classi cation task. The
MixDCNN model is learnt by rst ne-tuning a model to K subsets of data,
in this case by using di erent organ types. We then jointly optimise these K
DCNN models by using the mixture of DCNNs framework. Combining these
two approaches yields improved performance and demonstrates the importance
of learning complementary models to perform accurate classi cation with the
performance improving from 0:603 to 0:629. We note that the MixDCNN model
was only trained for 2 epochs we expect improved performance with a model
which has been trained for longer. Finally, this system is fully automatic as it
does not require the organ (content) type to be speci ed at test time.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements</title>
      <p>The Australian Centre for Robotic Vision is supported by the Australian Research
Council via the Centre of Excellence program.
7. Joly, Alexis and Goeau, Herve and Glotin, Herve and Spampinato, Concetto and
Bonnet, Pierre and Vellinga , Willem-Pier and Champ, Julien and Planque, Robert
and Palazzo, Simone and Muller, Henning. Lifeclef 2016: multimedia life species
identi cation challenges. In Proceedings of CLEF 2016, 2016.
8. Asma Rejeb Sfar, Nozha Boujemaa, and Donald Geman. Con dence sets for
negrained categorization and plant species identi cation. IJCV, 2014.
9. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going
deeper with convolutions. arXiv:1409.4842, 2014.
10. Ning Zhang, Je Donahue, Ross Girshick, and Trevor Darrell. Part-based R-CNNs
for ne-grained category detection. In ECCV, pages 834{849. 2014.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lempitsky</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Symbiotic segmentation and part localization for ne-grained categorization</article-title>
          .
          <source>In ICCV</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Efstratios</given-names>
            <surname>Gavves</surname>
          </string-name>
          , Basura Fernando,
          <source>Cees GM Snoek</source>
          ,
          <article-title>Arnold WM Smeulders, and Tinne Tuytelaars. Local alignments for ne-grained categorization</article-title>
          .
          <source>International Journal of Computer Vision</source>
          , pages
          <volume>1</volume>
          {
          <fpage>22</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>ZongYuan</given-names>
            <surname>Ge</surname>
          </string-name>
          , Alex Bewley,
          <string-name>
            <surname>Christopher</surname>
            <given-names>McCool</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Ben</given-names>
            <surname>Upcroft</surname>
          </string-name>
          , Conrad Sanderson, and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Corke</surname>
          </string-name>
          .
          <article-title>Fine-grained classi cation via mixture of deep convolutional neural networks</article-title>
          .
          <source>WACV</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>ZongYuan</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <surname>Christopher</surname>
            <given-names>McCool</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Conrad</given-names>
            <surname>Sanderson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Corke</surname>
          </string-name>
          .
          <article-title>Subset feature learning for ne-grained classi cation</article-title>
          .
          <source>CVPR Workshop on Deep Vision</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Herve Goeau, Pierre Bonnet, and
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Joly</surname>
          </string-name>
          .
          <article-title>Plant identi cation in an open-world (lifeclef 2016)</article-title>
          .
          <source>In CLEF working notes</source>
          <year>2016</year>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Yangqing</given-names>
            <surname>Jia</surname>
          </string-name>
          , Evan Shelhamer, Je Donahue, Sergey Karayev,
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Long</surname>
          </string-name>
          , Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Ca e:
          <article-title>Convolutional architecture for fast feature embedding</article-title>
          .
          <source>arXiv:1408.5093</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>