<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SeaCLEF 2016: Ob ject proposal classi cation for sh detection in underwater videos</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jonas Jager</string-name>
          <email>Jonas.Jaeger@et.hs-fulda.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Rodner</string-name>
          <email>Erik.Rodner@uni-jena.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joachim Denzler</string-name>
          <email>Joachim.Denzler@uni-jena.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviane Wol</string-name>
          <email>Viviane.Wolff@et.hs-fulda.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klaus Fricke-Neuderth</string-name>
          <email>Klaus.Fricke-Neuderth@et.hs-fulda.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Vision Group, Friedrich Schiller University Jena</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electrical Engineering and Information Technology, Fulda University of Applied Sciences</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This working note describes the results of CVG Jena Fulda team for the sh recognition task in SeaCLEF 2016. Our method is based on convolutional neural networks applied to object proposals for detection as well as species classi cation. We are using background subtraction proposals that are ltered by a binary SVM classi er for sh detection and a multiclass SVM for species classi cation. Both SVM's utilize CNN features extracted from AlexNet. With this pipeline we achieve a recognition precision of 66% and a normalized counting score of 58% on the provided test dataset. We also show that classi cation of background subtraction proposals works much better for sh detection than background subtraction on its own.</p>
      </abstract>
      <kwd-group>
        <kwd>Object proposals</kwd>
        <kwd>R-CNN</kwd>
        <kwd>CNN features</kwd>
        <kwd>Fine-grained classi cation</kwd>
        <kwd>Fish detection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This paper presents the participation of the CVG Jena Fulda team in the
SeaCLEF 2016 Task 1. This task deals with automatic sh recognition of coral
reef species in low resolution videos. All shes are presented in their natural
unrestricted habitat. See Fig. 1 for example frames.</p>
      <p>This task is important to enhance computer vision methods for biodiversity
applications. Many scientists in the eld of ecology collect large amounts of video
data to monitor biodiversity in their speci c applications. But manual analysis
of this data is time consuming and requires knowledge of rear human experts,
which makes it impossible to evaluate data in a large scale. However, this large
scale analysis is essential to obtain the knowledge to save ecosystems that have
a large impact on the human population. Therefore, tools for automatic video
analysis need to be developed to support the work of ecologists.</p>
      <p>
        We have a special interest in this task because our team works on a closely
related problem [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In our application we deal with high resolution underwater
video analysis of sh species at the Adriatic sea in Croatia.
      </p>
      <p>We noticed that detection is a crucial part in a sh classi cation and
counting system. But we also experienced that sh detection is a di cult problem,
due to lighting changes and the complex background in a natural environment.
Therefore, we focus in the following paper on robust sh detection.</p>
      <p>
        Last years participants [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ] in this task used median image background
subtraction for sh detection. Boom et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] also utilized background
subtraction methods and post processed detection results with an objectness
lter to remove bad detections. In contrast to that, we classify sh proposals by
CNN features [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>In this work we propose the use of object proposal classi cation for sh
detection. Object proposals are obtained by background subtraction and then
classi ed into sh and background by a binary support vector machine (SVM).
For sh recognition we utilize a multiclass SVM trained for the 15 considered
species. Both SVM's are using the same CNN features, extracted from AlexNet
[8], for prediction.</p>
      <p>
        Our detection approach is very similar to the idea of region-based
convolutional neural networks (R-CNN) presented by Girshick et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In contrast to
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] we are using the background subtraction method of Stau er and Grimson
[12] instead of selective search [14] for proposal generation, since we can exploit
time information in the video data. Another di erence is that we do not apply
domain speci c ne tuning to the CNN.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Fish Dataset</title>
      <p>The provided dataset: The dataset contains videos and images of sh species
in their natural coral reef environment. It is divided into a training and a test
set. Example frames from six di erent videos are shown in Fig. 1.</p>
      <p>The provided training set consists of 20 low resolution videos and more than
20000 sample images of 15 sh species. There are 5 videos with a resolution of
640 480 pixels and 15 videos with 320 240 pixels. All videos are annotated
by two human experts with bounding boxes and species names.</p>
      <p>The test set contains 73 videos with a resolution of 320 240 pixels.
Dataset preparation: We split the given training videos into two parts with
10 videos each. One part will be used as validation set. The other 10 videos and
all sample images will be used for training and will be called training set in the
rest of this paper.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <sec id="sec-3-1">
        <title>Overview</title>
        <p>Our main idea is to build a sh detector and to use detections for species
classi cation. Since the application of background subtraction methods on its own
leads to a large number of false detections, we use background subtraction to
get sh proposals and classify each proposal as as sh or background. Then, all
sh detections are classi ed as one of 15 species or rejected. Both classi eres, for
detection and species recognition, are using the same features.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Object proposal classi cation for sh detection</title>
        <p>Our sh detection approach consists of three steps. (1.) Generation of bounding
box proposals. (2.) Extraction of CNN features for each proposal. (3.) Classi
cation of each bounding box proposal as sh or background. Please see Fig. 2 for
illustration of these steps.</p>
        <p>In step (1.) we use the background subtraction algorithm of Stau er and
Grimson [12], which uses a probabilistic background model that represents each
pixel as a mixture of Gaussians. The result of this algorithm is a binary mask
that indicates which pixels are background (see Fig. 2a). This mask is further
used to obtain a second background mask (see Fig. 2b) by applying an erosion
lter to it, which allows us to separate nearby shes.</p>
        <p>After that we apply the blob detection method of Suzuki and Abe [13] to
both masks to get bounding box proposals (see Fig. 2c). Bounding boxes that
have a smaller area than 100 pixels are removed, since these proposals are to
small for species classi cation.</p>
        <p>(2.) Now we use the generated proposals to extract CNN features from
AlexNet [8], that was pretrained on ILSVRC 2012 [11]. As features we choose
the activations of the 7th hidden layer (relu7 ) in the convolutional network. Note
that we did not ne tune the convolutional net by training it with sh images.
(a) Background subtraction mask
(b) Eroded mask
(c) Object proposals
(d) Boxes that where classi ed as sh</p>
        <p>(3.) Based on these features we utilize a binary SVM for classifying each
bounding box proposal as sh or background (see Fig. 2d). Then we choose from
all sh detections the boxes with a con dence level that is higher or equal to 0.5.
In oder to obtain a probability measure from SVM scores we used Platt scaling
[10] as implemented in scikit-learn [9].</p>
        <p>As a post processing step we apply non-maximum suppression to remove
duplicate boxes for all shes.</p>
        <p>Detector training: To train our detector we extract CNN features (see step
(2.)) and t the SVM classi er to the classes background and sh. As training
data we utilize the sh sample images of the training set (see section 2) and
extract all annotated shes from the 10 training videos. As background examples
we generate object proposals from training set videos and extract those boxes
that have no intersection with a ground truth sh box.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Species classi cation using CNN features</title>
        <p>
          As in our previous work [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] we use CNN features [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and a multiclass SVM for
species prediction. We utilize the same CNN features that where extracted in
detection step (2.) from AlexNet [8], which was pretrained on ImageNet. As
features we choose the activations of the 7th hidden layer (relu7) in the network.
        </p>
        <p>When the con dence level for a classi cation is lower than 0:5 we consider it
as an unknown sh and reject it. In order to get probabilities from SVM scores
we use the method of Wu et al. [15].</p>
        <p>The SVM is trained with a one-vs-rest strategy for the 15 considered sh
species. The training data are composed of the provided species sample images
and all annotated shes cropped from training set (see section 2) videos.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Fish detection results</title>
        <p>
          One of our main interest is how well object proposal classi cation (OPC) works
for sh detection compared to background subtraction on its own. For that
purpose we will rst describe the methods listed in results Tab. 1 and then
de ne our Pascal VOC [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] like evaluation process. Finally we will discuss our
sh detection results presented in Fig. 3 and Tab. 1.
Methods: The rst method, called BgsMedian in Tab. 1, computes a median
background image for all frames in a video and subtracts the current frame from
that background image. A speci c pixel in the median image is calculated by
using the median value of all pixels at same position in the video. This method
was also used by last year participants [
          <xref ref-type="bibr" rid="ref2 ref3">3, 2</xref>
          ].
        </p>
        <p>The second method, referenced as BgsGMM, was developed by Stau er and
Grimson [12] and uses a probabilistic background model that represents each
pixel as a mixture of Gaussians.</p>
        <p>To obtain bounding boxes from these background subtraction methods we
applied blob detection proposed by Suzuki and Abe [13].</p>
        <p>OPC (BgsMedian) and OPC (BgsGMM) are using the pipeline described in
section 3.2 with the exception that BgsMedian is used for bounding box proposal
generation in OPC (BgsMedian).</p>
        <p>In our Experiments we ne tuned the parameter of the background
subtraction methods for sh detection when used on its own. When we used these
methods for proposal generation the parameter have been adjusted to get many
sh proposals.</p>
        <p>
          Evaluation process: As in Pascal VOC [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], we consider a sh detection as
correct (true positive) if the intersection over union ratio (iou) for a ground
truth box with a predicted box is greater or equal to 0:5. If there is more than
one predicted box that satis es this condition for a speci c ground truth box:
Then one predicted box is considered as true positive and the remaining boxes
are counted as false positives.
        </p>
        <p>Discussion: Fig. 3 and Tab. 1 present detection results for the above mentioned
methods. The OPC (BgsGMM) approach works best in our setup. For detection
by background subtraction BgsGMM has a higher average precision score than
BgsMedain. Whereby average precision of BgsGMM is 36:35% lower than OPC
(BgsGMM).</p>
        <p>In general it can be observed that OPC detection approaches work better
than background subtraction in our setup, although the CNN was not ne tuned
to sh images.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Species classi cation</title>
        <p>For species classi cation we used the detections of OPC (BgsGMM) and
extracted CNN features to classify each detection as one of the 15 considered sh
species. If the con dence level for a classi cation was lower than 0:5 it was
rejected.</p>
        <p>With this pipeline we achieve a counting score (CS) of 83%, a precision of
66% and a normalized counting score (NCS) of 58% (see Fig. 4). CS and NCS
are used as scoring functions in SeaCLEF 2016 and are de ned as:
d
CS = e Ngt
(1)
with d as the di erence between the number of ground truth occurrences Ngt
and the predicted occurrences per species.</p>
        <p>N CS = CS precision
(2)
This paper described our participation in SeaCLEF 2016 sh species recognition
task. We focused on robust sh detection, since the simple application of
background subtraction methods leads to a large number of false detections.
Therefore we compared traditional background subtraction methods, mainly used for
sh detection so far, with object proposal classi cation (OPC) for detection.
We show that OPC sh detection (Fig. 2) works much better than background
subtraction (Fig. 3) in our setup.</p>
        <p>For species recognition we use the same CNN features as for detection and
classi ed each sh with a multiclass SVM. Using this pipeline we achieve a
normalized counting score of 58% and a precision of 66% (see Fig. 4) on the
provided test dataset.</p>
        <p>For the future we plan to incorporate sh tracking. We also want to use larger
CNN models and ne tune these models to sh data.
8. Alex Krizhevsky, Ilya Sutskever, and Geo rey E. Hinton. Imagenet classi cation
with deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou,
and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems
25, pages 1097{1105. Curran Associates, Inc., 2012.
9. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,
M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine
learning in Python. Journal of Machine Learning Research, 12:2825{2830, 2011.
10. John C. Platt. Probabilistic outputs for support vector machines and
comparisons to regularized likelihood methods. In ADVANCES IN LARGE MARGIN
CLASSIFIERS, pages 61{74. MIT Press, 1999.
11. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean
Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.</p>
        <p>International Journal of Computer Vision (IJCV), 115(3):211{252, 2015.
12. Chris Stau er and W. Eric L. Grimson. Adaptive background mixture models for
real-time tracking. In 1999 Conference on Computer Vision and Pattern
Recognition (CVPR '99), 23-25 June 1999, Ft. Collins, CO, USA, pages 2246{2252,
1999.
13. Satoshi Suzuki and Keiichi Abe. Topological structural analysis of digitized binary
images by border following. Computer Vision, Graphics, and Image Processing,
30(1):32{46, 1985.
14. J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders. Selective
search for object recognition. International Journal of Computer Vision, 2013.
15. Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng. Probability estimates for
multiclass classi cation by pairwise coupling. Journal of Machine Learning Research,
5:975{1005, 2003.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bastiaan</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boom</surname>
          </string-name>
          , Jiyin He, Simone Palazzo, Phoenix X. Huang, Cigdem Beyan,
          <string-name>
            <surname>Hsiu-Mei</surname>
            <given-names>Chou</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fang-Pang</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Concetto
          <string-name>
            <surname>Spampinato</surname>
          </string-name>
          , and
          <string-name>
            <surname>Robert</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Fisher</surname>
          </string-name>
          .
          <article-title>A research tool for long-term and continuous analysis of sh assemblage in coral-reefs using underwater camera footage</article-title>
          .
          <source>Ecological Informatics</source>
          ,
          <volume>23</volume>
          :
          <fpage>83</fpage>
          {
          <fpage>97</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Jorge</given-names>
            <surname>Cabrera-Gmez</surname>
          </string-name>
          , Modesto Castrilln Santana, Antonio Domnguez-Brito, Daniel Hernandez-Sosa,
          <article-title>Josep Isern-Gonzlez, and Javier Lorenzo-Navarro. Exploring the use of local descriptors for sh recognition in lifeclef 2015</article-title>
          .
          <source>In Working Notes of the 6th International Conference of the CLEF Initiative. CEUR Workshop Proceedings</source>
          ,
          <year>2015</year>
          . Vol-
          <volume>1391</volume>
          , urn:nbn:de:
          <fpage>0074</fpage>
          -
          <lpage>1391</lpage>
          -8.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Sungbin</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>Fish identi cation in underwater video with deep convolutional neural network: Snumedinfo at lifeclef sh task 2015</article-title>
          .
          <source>In Working Notes of the 6th International Conference of the CLEF Initiative. CEUR Workshop Proceedings</source>
          ,
          <year>2015</year>
          . Vol-
          <volume>1391</volume>
          , urn:nbn:de:
          <fpage>0074</fpage>
          -
          <lpage>1391</lpage>
          -8.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Je</given-names>
            <surname>Donahue</surname>
          </string-name>
          , Yangqing Jia, Oriol Vinyals, Judy Ho man, Ning Zhang, Eric Tzeng, and
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Darrell</surname>
          </string-name>
          .
          <article-title>Decaf: A deep convolutional activation feature for generic visual recognition</article-title>
          .
          <source>CoRR, abs/1310.1531</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Mark</given-names>
            <surname>Everingham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Ali Eslami</surname>
          </string-name>
          , Luc Van Gool,
          <string-name>
            <surname>Christopher</surname>
            <given-names>K. I. Williams</given-names>
          </string-name>
          , John Winn, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>The pascal visual object classes challenge: A retrospective</article-title>
          .
          <source>International Journal of Computer Vision</source>
          ,
          <volume>111</volume>
          (
          <issue>1</issue>
          ):
          <volume>98</volume>
          {
          <fpage>136</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ross</surname>
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Girshick</surname>
            , Je Donahue, Trevor Darrell, and
            <given-names>Jitendra</given-names>
          </string-name>
          <string-name>
            <surname>Malik</surname>
          </string-name>
          .
          <article-title>Rich feature hierarchies for accurate object detection and semantic segmentation</article-title>
          .
          <source>CoRR, abs/1311.2524</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Jonas Jager, Marcel Simon, Joachim Denzler, Viviane Wol , Klaus FrickeNeuderth, and Claudia Kruschel.
          <article-title>Croatian sh dataset: Fine-grained classi cation of sh species in their natural habitat</article-title>
          . In T.
          <string-name>
            <surname>Pltz S. McKenna T. Amaral</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Matthews</surname>
          </string-name>
          and R. Fisher, editors,
          <source>Proceedings of the Machine Vision of Animals and their Behaviour (MVAB)</source>
          , pages
          <fpage>6</fpage>
          .1{
          <issue>6</issue>
          .7. BMVA Press,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>