<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fish identification in underwater video with deep convolutional neural network: SNUMedinfo at LifeCLEF fish task 2015</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sungbin Choi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Biomedical Engineering, Seoul National University</institution>
          ,
          <country>Republic of Korea</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes our participation at the LifeCLEF Fish task 2015. The task is about video-based fish identification. Firstly, we applied foreground detection method with selective search to extract candidate fish object window. Then deep convolutional neural network is used to classify fish species per window. Classification results are post-processed to produce final identification output. Experimental results showed effective performance in spite of challenging task condition. Our approach achieved best performance in this task.</p>
      </abstract>
      <kwd-group>
        <kwd>Object detection</kwd>
        <kwd>Image classification</kwd>
        <kwd>Deep convolutional neural network</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In this paper, we describe the participation of the SNUMedinfo team at the LifeCLEF
Fish task 2015. The purpose of task is automatically counting separate fish species in
video segments. Training data includes video clips with annotation and sample images
of 15 fish species. For a detailed introduction of the task, please see the overview paper
of this task (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ).
      </p>
      <p>
        In recent years, deep Convolutional Neural Network (CNN) has improved automatic
image classification performance dramatically (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ). In this study, we experimented with
GoogLeNet (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) which has shown effective performance in a recent ImageNet Challenge
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ).
      </p>
      <p>Firstly, we applied foreground detection method with selective search to extract
candidate fish object window (Section 2.1). CNN is trained and used to identify fish species
in candidate window (Section 2.2). Then CNN classification results are further refined
to produce final identification output (Section 2.3). Our experimental methods are
detailed in the next section.
2.1</p>
      <sec id="sec-1-1">
        <title>Candidate fish object window extraction</title>
      </sec>
      <sec id="sec-1-2">
        <title>Foreground detection</title>
        <p>
          Firstly, we tried to identify background region per each video clip. If a video clip has
S temporal segments, each pixel location has corresponding S pixel values. Per each
pixel location in video clip, we took median value as background pixel value (Fig 2).
Pixels having pixel values different from this background more than predefined
threshold, is considered as foreground pixel. Bilateral filter is applied to smooth
foreground image (Fig 3).
Then, we applied selective search (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) to extract candidate fish object window.
2.2
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>Fish species identification</title>
      </sec>
      <sec id="sec-1-4">
        <title>Preparing training set for CNN</title>
        <p>In fish task training set, there are 20 video clips with bounding box annotation, and
samples images of 15 considered species. We formulated training set and validation
set as follows.</p>
        <p>Training set: Samples images of 15 fish species + 10 video clips</p>
        <p>Validation set: Other 10 video clips
Per each video clip, among candidate fish object windows extracted from section 2.1,
windows having intersection over union area (IoU) over 0.7 with ground truth
bounding box annotation is considered as target fish species positive example. Candidate
fish object windows having IoU less than 0.2 is considered as negative example (No
fish inside window). So we have 16 labels for image classification (15 fish species +
‘No fish’)</p>
      </sec>
      <sec id="sec-1-5">
        <title>Training CNN</title>
        <p>We utilized GoogLeNet for image classification. GoogLeNet incorporates Inception
module with the intention of increasing network depth with computational efficiency.
Training CNN for fish identification started from GoogLeNet pretrained on ImageNet
dataset. We finetuned CNN on fish identification training set (initial learning rate
0.001; batch_size:40).
2.3</p>
      </sec>
      <sec id="sec-1-6">
        <title>Post-processing classification results</title>
      </sec>
      <sec id="sec-1-7">
        <title>Filtering CNN output within each video segment</title>
        <p>CNN classified results from Section 2.2 contains lots of image windows overlapped to
each other, so we need to select best matching window for final output. Firstly, among
all target positive windows, we selected maximum 20 windows having top
classification score from CNN. Secondly, windows having IoU more than 0.3 is considered as
duplicate, so it is removed.</p>
      </sec>
      <sec id="sec-1-8">
        <title>Refining classification output by utilizing temporally connected video segment</title>
        <p>Video segments are temporally connected, so existing fish object in previous frame is
expected to be located within nearby region in next frame. Based on this idea, we
applied following two rules.</p>
        <p>Rule 1 (Adding): If video segment (k-1) and (k+1) has target positive fish object
window in nearby geographic location, but video segment (k) does not have target fish
object window in that location, then fish is expected to be in segment (k) also.
Rule 2 (Removing): If video segment (k) has target positive fish object, but both
video segment (k-1) and (k+1) does not have target fish object window in nearby
location, then it is expected that fish is expected to be not in segment (k).
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <p>In fish task test set, 73 video clips are given. We submitted three different runs. In
SNUMedinfo1 and SNUMedinfo2, assigned 10 video clips in training set and
validation set is switched (Section 2.2). SNUMedinfo3 is same as SNUMedinfo1, but
Filtering CNN output within each video segment step (Section 2.3) is not applied.
Evaluation metric for this task was counting score, precision and normalized counting
score (For a detailed introduction to these evaluation metric, please see the overview
paper of this task). Counting score is calculated based on the difference between the
number of occurrences in the submitted run and the ground truth. Precision is calculated
as number of true positive divided by number of true positive plus false positive.
Normalized counting score is calculated as multiplication of counting score with precision.
Evaluation results on test set is described in following table.</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion References 4</title>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>Compared to other image recognition task such as ImageNet or LifeCLEF Plant task,
this task deals with low quality underwater video. So our experiments involved
additional pre-processing and post-processing step besides deep convolutional neural
network training for image recognition. To further analyze contributions of each step
with regard to the final performance, we need to experiment with various
combinations of method options. We postpone thorough analysis of each step to future study
when test set ground truth becomes available.</p>
      <p>
        But generally, our overall fish identification performance was very effective in spite
of challenging task conditions of varying video images in underwater scene. Our
counting score approached near 0.9 and precision exceed 0.8 (Table 1).
Our post-processing step utilizing temporal neighborhood segment (Section 2.3)
clearly improved performance when we compare run SNUMedinfo1 (with temporal
post-processing step) to SNUMedinfo3 (without temporal post-processing step).
Technically, this method is simple compared to more advanced techniques such as
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        ), but it was very helpful for improving precision.
      </p>
      <p>This fish tasks deals with underwater video image, so it was more challenging than
general image classification task and additional steps were needed for pre-processing
and post-processing. We combined foreground detection method with selective search
for candidate fish object window detection. Then CNN pretrained on other general
object classification task is trained to classify fish species. Outputs from CNN
classification results are further refined to produce final identification results. In our future work,
we’ll explore other methodological options to find more effective method.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cappellato</surname>
            <given-names>L</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          , and San Juan,
          <string-name>
            <surname>E.</surname>
          </string-name>
          <article-title>CLEF 2015 Labs and Workshops</article-title>
          .
          <source>CEUR Workshop Proceedings (CEUR-WS.org)</source>
          ;
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Krizhevsky</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            <given-names>I</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
            <given-names>GE</given-names>
          </string-name>
          , editors.
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>Advances in neural information processing systems</source>
          ;
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Szegedy</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            <given-names>W</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sermanet</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reed</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anguelov</surname>
            <given-names>D</given-names>
          </string-name>
          , et al.
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>arXiv preprint arXiv:14094842</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Russakovsky</surname>
            <given-names>O</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Su</surname>
            <given-names>H</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krause</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Satheesh</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ma</surname>
            <given-names>S</given-names>
          </string-name>
          , et al.
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          .
          <source>arXiv preprint arXiv:14090575</source>
          .
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Uijlings</surname>
            <given-names>JRR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van de Sande</surname>
            <given-names>KEA</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gevers</surname>
            <given-names>T</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smeulders</surname>
            <given-names>AWM</given-names>
          </string-name>
          .
          <article-title>Selective Search for Object Recognition</article-title>
          .
          <source>Int J Comput Vis</source>
          .
          <year>2013</year>
          ;
          <volume>104</volume>
          (
          <issue>2</issue>
          ):
          <fpage>154</fpage>
          -
          <lpage>71</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kae</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Marlin</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Learned-Miller</surname>
            <given-names>E</given-names>
          </string-name>
          , editors.
          <article-title>The Shape-Time Random Field for Semantic Video Labeling</article-title>
          .
          <source>Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2014 IEEE Conference on; 2014</source>
          23-
          <issue>28</issue>
          <year>June 2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>