<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Marine Animal Detection and Recognition with Advanced Deep Learning Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Peiqin Zhuang</string-name>
          <email>pq.zhuang@siat.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Linjie Xing</string-name>
          <email>lj.xing@siat.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yanlin Liu</string-name>
          <email>yl.liu@siat.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sheng Guo</string-name>
          <email>sheng.guo@siat.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Qiao</string-name>
          <email>yu.qiao@siat.ac.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Shenzhen key Lab. Of CVPR, Shenzhen Institutes of Advanced Technology Chinese Academy of Sciences</institution>
          ,
          <country country="CN">P.R. China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper summarizes SIATMMLAB's contributions in SEACLEF2017 task [1]. We took part in three subtasks with advanced deep learning models. In Automatic Fish Identification and Species Recognition task, we exploited different frameworks to detect the proposal boxes of foreground fish, then fine-tuned a pre-trained neural network to classify the fish. In Automatic Frame-level Salmon Identification task, we utilized the BN-Inception [2] network to identify whether a video frame contains salmons or not. In Marine Animal Recognition task, we examined different neural networks to make classification based on weakly-labelled images. Our methods achieve good results in both task1 and task3.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Learning</kwd>
        <kwd>Fish Detection</kwd>
        <kwd>Weakly-labelled</kwd>
        <kwd>Image Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Driven by the increasing demand of ecological surveillance and biodiversity
monitoring under the water, more sea-related multimedia data were collected with the
aid of advanced imaging systems. However, with the exponential growth of
searelated visual data, it is prohibitive to count on the manual handling by the experts to
annotate these datasets. Therefore, automatic analyzing the contents of underwater
image is key to make use of the exponentially increasing underwater data.
SEACLEF2017 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] have launched four tasks to explore suitable methods for handling
these multimedia data.
      </p>
      <p>In this paper, we elaborate our methods applied in SEACLEF2017, and the
remainder of this paper is organized as follows. In Section 2, we present our
approach for task 1, including the method of proposing foreground boxes, reducing
background boxes and classifying the boxes we got. In Section 3, we report our
approach for frame-level salmon identification in task 2. In Section 4, we describe the
procedure for handling weakly-labeled fish data in task 3. Finally, we provide a
comprehensive analysis of our works and discuss the directions for further research.</p>
    </sec>
    <sec id="sec-2">
      <title>Species Recognition on Coral Reef Videos</title>
      <p>
        In this task, we automatically identify and recognize fish species by giving a
bounding box and a corresponding label. To reach this goal, we employed a
traditional pipeline [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3,4,5</xref>
        ] for detecting fish. Firstly, we used different detection
architectures based on deep neural networks to generate potential bounding boxes,
and then fine-tuned a pre-trained neural network with training data to classify
bounding boxes. Section 2.1 will present the two methods we used to detect bounding
boxes, while Section 2.2 describes the procedure of classification. We show our final
result in Section 2.3. At the end, we make some discussions on how to improve
detection performance in the future.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Foreground Detection</title>
        <p>
          Inspired by the past participants of SEACLEF [
          <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
          ], we separate species recognition
process into detection step and classification step. In the past three years, end-to-end
[
          <xref ref-type="bibr" rid="ref3 ref4 ref5">3,4,5</xref>
          ] detection methods have prevailed in most of detection task because of their
dramatic performance, so we selected two latest models as our detector.
        </p>
        <p>
          Firstly, we chose SSD [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to differentiate the regions between foreground fish and
background. With the advance of producing predictions from different features maps
of different scales [
          <xref ref-type="bibr" rid="ref5 ref6">5,6</xref>
          ], we got acceptable results by using SSD to generate potential
bounding boxes. Besides, we also used another detection architecture PVANET [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
to detect foreground fish. At detection step, we extracted all true positive frames with
annotations and then chose one tenth of positive frames at each video to formulate
validation dataset. The rest of positive frames were used for training. A detection
result example is shown in Fig. 1.
Afterwards to remove false positives, we adopted Sungbin Chois’s [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] method to
compute a background image (Fig. 2) by selecting the median value of at each pixel
position for every video, then used background subtraction and erosion to create a
mask (Fig. 3) for each frame. If the area of background in bounding box is greater
than a threshold we set, we consider it as background, and consequently discard it.
Compared to other background subtraction methods [
          <xref ref-type="bibr" rid="ref8 ref9">8,9</xref>
          ], we adopted it as a
postprocessing means after detection, not directly for detecting bounding boxes.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2 Species Classification</title>
        <p>In the classification step, we extracted true positives from the training dataset with the
annotation information. Besides, we also added some background as a new class to
further reduce the amounts of false positives in our final results. Given species
imbalance in each video, we extracted all videos together and then separated those
patches into training set and validation set according to the labels. The ratio between
training dataset and validation dataset is the same as in the detection step.</p>
        <p>
          For the sake of attaining a high performance of classification, we chose the
ResNet-10 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] as our classifier. According to the relevant information of this task,
we set the number of output classes as 16, and used the training dataset to fine-tune
this ResNet-10 network.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 Results</title>
        <p>
          The test dataset contains 73 videos, similar scenario to the training dataset. By
employing the procedure of detection and classification, we finally submitted two
results. SIATMMLAB_run1 used the SSD framework to detect foreground fish, while
SIATMMLAB_run2 utilized the PVANET to generate potential bounding boxes as
well. Both classification of two above results used ResNet-10 network as classifier.
Our normalized scores of two results are respectively 0.66 and 0.71 (Table 1), nearly
equal to the best result [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] in 2015.
        </p>
        <sec id="sec-2-3-1">
          <title>SIATMMLAB_run1</title>
        </sec>
        <sec id="sec-2-3-2">
          <title>SIATMMLAB_run2</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>2.4 Discussion</title>
        <sec id="sec-2-4-1">
          <title>Counting score</title>
        </sec>
        <sec id="sec-2-4-2">
          <title>Precision Normalized score 0.87 0.88</title>
          <p>0.76
0.80
0.66
0.71
Although our approaches achieve a good performance on task 1, there still exist some
aspects need to be further improved. By analyzing the results, we found that several
fish were not detected in some particular video clips , which influences the counting
score and precision. Besides, our detector often mistook some background regions
with abundant texture for foreground fish. Since the video naturally has context
information between frames, it will help us to fix the discontinuity of video sequence
in detection results and achieve a higher performance in the future.</p>
          <p>In task 1, we combined both detection and classification methods to recognize fish
species on coral reef videos, and achieved an exciting results in the end. In the future,
we will try more fusions of detection and classification, use some pre-processing
methods to augment video resolution, reduce noise influence caused by the
illumination changes, and utilize more video context information. We believe that
effective incorporation of these methods will provide an opportunity for ecologist to
monitor biodiversity more accurately than manual handling in the future.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Frame-level Salmon Identification</title>
      <p>
        In this task, we need identify the appearance of salmon in each frame [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since the
ratio of frame which salmon appeared is rare and the salmons are often small, this
task appears to be challenging for us. In Section 3.1, we will analyze some relevant
information about the training data; Section 3.2 presents the method we used and our
final result. Finally, we will make a discussion about task 2 in Section 3.3.
      </p>
      <sec id="sec-3-1">
        <title>3.1 Data Analysis</title>
        <p>The training data contains 8 videos about salmon. By analyzing the frames extracted
from all videos, it is notable that the amounts of positives are much less than the
counterpart of negatives. There are nearly 1300 positives, while the negatives
numbers are almost 59k, which creates a large data imbalance between positive
samples and negative samples. Besides, most of the frames are only filled with black
background, which means information about salmon in the fame is rare (Fig. 4)..
Some frames even contain a green static water turbine in it.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Experiments and Results</title>
        <p>In order to solve the imbalance between positives and negatives, we used all positives
and randomly selected 1500 negatives to formulate our training dataset and validation
dataset. We chose nine tenths of those as training set , while the rest were used as
validation set.</p>
        <p>
          We regarded frame-level salmon identification as a binary classification problem,
so we use the BN-Inception [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] network as our classifier. Through fine-tuning our
network with training dataset, we achieved 97% accuracy in our validation dataset.
We also tested the model at the whole dataset, which showed only 2723 images were
incorrectly classified within 59956 images.
        </p>
        <p>The test dataset contains 8 videos. Compared to the training dataset, the test dataset
seems to be a little different from training dataset. The green water turbine in most of
frames is always revolving, which causes illumination changes and increases much
noise. In this case, our model often mistook many frames with illumination changes
for positives and created many false positives. We submitted one run as our final
result. The precision for our result is 0.04 (Table 2), which means most of our
positives are incorrect. Our model performance suffers a huge decline in test dataset.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3 Discussion</title>
        <p>We adopted BN-Inception network to identify whether a video frame contains the
salmons or not. Although our model successfully reached the goal of this task on the
training dataset, the accuracy of test dataset performed poorly with challenging
constraints. Firstly, since the salmon is so small and most of frames are filled with
black background, it is challenging for us to choose frames, which essentially
represent the properties of positives and negatives. Next, our model finally mistook
many negatives for positives because of the illumination changes brought by the
water turbine revolving.</p>
        <p>Given these constraints, more work will be carried out to help improve models
performance. Strengthening feature representation and introducing data-mining tricks
may further solve the problem of frames selection. Besides, more pre-processing
technologies will be employed to improve the quality of frames and reduce influence
of illumination changes. According to the two above aspects, we will incorporate
more methods to increase our model generalization and accuracy in the future.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Marine Animal Species Recognition</title>
      <p>
        Task 3 aims to classify marine animals from weakly-labelled images collected by
keyword queries on the internet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. With the difficulty of high similarity between two
species and weakly-labelled annotations, making exact recognition seems to be
challenging for us. Section 4.1 will analyze the training dataset and its existing
difficulties. Section 4.2 shows the details of our experiments and final results. Finally,
we will make a slight discussion about our work in Section 4.3.
      </p>
      <sec id="sec-4-1">
        <title>4.1 Data Analysis</title>
        <p>The training dataset includes nearly 13k images, but some of species have little
numbers of images compared to other species. During the period of our experiment,
we also found some other flaws in the training dataset: some maps describing the
distribution region of corresponding species were added to the training dataset and
sometimes two identical images both appeared in different species, which often</p>
        <sec id="sec-4-1-1">
          <title>Model</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>Experiment1</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>BN-Inception</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>Experiment2</title>
        </sec>
        <sec id="sec-4-1-5">
          <title>BN-Inception</title>
        </sec>
        <sec id="sec-4-1-6">
          <title>Experiment3</title>
        </sec>
        <sec id="sec-4-1-7">
          <title>BN-Inception</title>
        </sec>
        <sec id="sec-4-1-8">
          <title>Experiment4</title>
        </sec>
        <sec id="sec-4-1-9">
          <title>ResNet-50</title>
        </sec>
        <sec id="sec-4-1-10">
          <title>Experiment5</title>
        </sec>
        <sec id="sec-4-1-11">
          <title>ResNet-50</title>
          <p>Input size
360×260
600×400
600×400
360×260
600×400
224
224
336
224
224
0.840
0.818
0.845
0.837
0.800
caused classification mistakes. In this case, we made a slight examination of training
dataset to pick out the bad data. Besides, we also added nearly 2k images extracted
from YouTube videos to increase the generalization of our model. We split all the
data into training data and validation data and selected 10 images per species as
validation set.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Experiments and Results</title>
        <p>
          Driven by the high performance of deep neural networks [
          <xref ref-type="bibr" rid="ref10 ref11 ref12 ref2">2,10,11,12</xref>
          ], we decided to
use two latest architectures of neural network: BN-Inception [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and ResNet-50 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
We did many experiments on training data, and evaluated our performance on the
validation set with the top-1 and top-5 metric. Some detail information are shown in
Table 3.
        </p>
        <p>
          We did such 5 experiments on the training dataset by changing the deep neural
network model [
          <xref ref-type="bibr" rid="ref10 ref2">2,10</xref>
          ], resizing the input image size, using different crop size [
          <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
          ]
to augment the training dataset. All our experiments achieved high performance in the
validation data in spite of the challenge this task brings. The accuracies measured by
top-1 metric were all higher than 0.8, and the counterparts measured by Top-5 metric
were also higher than 0.93.
        </p>
        <p>
          We finally submitted three runs. According to the source of test dataset, we
separated test dataset into two parts. Those images cropped from videos, which have
similar scenario with task 1’s data, were tested by a new RestNet-50 network trained
with task 1’s video data. The rest of test data were tested by the ablation of different
models [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] we have described before. SIATMMLAB_run1 used the ablation of
experiment 1 and experiment 4. The ablation of experiment 1, experiment 3 and
experiment 4 provided the result of SIATMMLAB_run2. SIATMMLAB_run3
includes the result of experiment 1 and experiment 3. The metric used to measure
final runs is average precision, the result of our runs are shown in Table 4.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3 Discussion</title>
        <p>0.61
0.61
Although our model achieved more than 0.8 accuracy with top-1 metric on the
validation dataset, it seems its performance slightly decreased in the test dataset.
Recalling the whole procedure of our experiment, we have not used the relevance
ranking information when we trained our model. Since the images are
weaklylabelled and the metric used to measure our results is average precision, using the
relevance information may help us to recognize the species more accurately in the
future. Furthermore, some species have high degree of similarity to each other, even
our human beings can not classify them easily without professional knowledge. How
to find an effective method to essentially represent these species is a key problem we
want to solve in our future work. Besides, some fish species were so small compared
to the whole image. In this case, we will employ salience map to primarily locate the
positions of fish, preparing for the recognition step.</p>
        <p>Automatically recognizing fish species is essential for handling sea-related
multimedia data.We hope to further improve classifier with more advanced models
ablation and auxiliary methods so as to recognize fish species more accurately when
the ecologist use these methods to monitor biodiversity.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Perspectives</title>
      <p>This paper has described our participation in SEACLEF2017. All our approaches are
based on the architectures of deep neural network, and achieved high performance
both in task 1 and task 3. Although the methods using with deep neural network have
good effect in this competition, there still exist some constrains, like low-resolution,
illumination changes and complicated background, when we handle these underwater
multimedia data. Based on these constraints, we will respectively investigate more
effective methods to solve these aspects and try to raise our performance to a new
level. More related work will be carried out to improve our model performance and
help to promote these methods to be used in the real-world application.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Joly</surname>
          </string-name>
          , Hervé Goëau, Hervé Glotin, Concetto Spampinato, Pierre Bonnet, WillemPier Vellinga,
          <string-name>
            <surname>Jean-Christophe</surname>
            <given-names>Lombardo</given-names>
          </string-name>
          , Robert Planqué, Simone Palazzo, Henning Müller.
          <article-title>LifeCLEF 2017 Lab Overview: multimedia species identification challenges</article-title>
          ,
          <source>Proceedings of CLEF</source>
          <year>2017</year>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Ioffe</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          .
          <article-title>Batch normalization: Accelerating deep network by reducing internal covariate shift</article-title>
          ,
          <string-name>
            <surname>In</surname>
            <given-names>ICML</given-names>
          </string-name>
          , pages
          <fpage>448</fpage>
          -
          <lpage>456</lpage>
          ,
          <year>2015</year>
          . 3.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Ross</given-names>
            <surname>Girshick</surname>
          </string-name>
          .
          <article-title>Fast-rcnn,</article-title>
          <string-name>
            <surname>In</surname>
            <given-names>ICCV</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
          <volume>2</volume>
          ,
          <issue>9</issue>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Shaoqing</given-names>
            <surname>Ren</surname>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
            ,
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Sun. Faster</surname>
          </string-name>
          R-CNN:
          <article-title>Towards real-time object detection with region proposal networks</article-title>
          ,
          <string-name>
            <surname>In</surname>
            <given-names>NIPS</given-names>
          </string-name>
          ,
          <year>2015</year>
          .
          <volume>1</volume>
          ,
          <issue>2</issue>
          ,
          <issue>6</issue>
          ,
          <issue>7</issue>
          ,
          <issue>8</issue>
          ,
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Wei</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Dragomir Anguelov, Dumitru Erhan,
          <string-name>
            <given-names>Christian</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Scott Reed, ChengYang Fu, Alexander C.
          <article-title>Berg. SSD: single shot multibox detector</article-title>
          ,
          <string-name>
            <surname>In</surname>
            <given-names>ECCV</given-names>
          </string-name>
          ,
          <year>2016</year>
          .
          <volume>1</volume>
          ,
          <issue>3</issue>
          ,
          <issue>4</issue>
          ,
          <issue>6</issue>
          ,
          <issue>7</issue>
          ,
          <fpage>8</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tsung-Yi</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Piotr Dollar, Ross Girshick, Kaiming He, Bharath Hariharan,
          <string-name>
            <given-names>Serge</given-names>
            <surname>Belongie</surname>
          </string-name>
          .
          <article-title>Feature Pyramid Networks for Object Detection</article-title>
          ,
          <source>arXiv preprint arXiv: 1612.03144</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kye-Hyeon</surname>
            <given-names>Kim</given-names>
          </string-name>
          , Sanghoon Hong, Byungseok Roh, Yeongjae Cheon, Minjie Park, PVANET:
          <article-title>Deep but Lightweight Neural Networks for Real-time Object Detection</article-title>
          ,
          <source>arXiv preprint arXiv: 1611.08588</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Sungbin</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Fish identification in underwater video with deep convolutional neural nerwork : Snumedinfo at lifeclef fish task 2015</article-title>
          .
          <source>In Working Notes of the 6th International Conference of the CLEF Initiative. CEUR Workshop Proceedings</source>
          ,
          <year>2015</year>
          . Vol-
          <volume>1391</volume>
          , urn:nbn:de:
          <fpage>0074</fpage>
          -
          <lpage>1391</lpage>
          -8.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Jonas</given-names>
            <surname>Jäger</surname>
          </string-name>
          , Erik Rodner, Joachim Denzler, Viviane Wolff,Klaus Fricke-Neuderth.
          <source>SeaCLEF</source>
          <year>2016</year>
          :
          <article-title>Object Proposal Classification for Fish Detection in Underwater Videos</article-title>
          .
          <source>In Working Notes of the 7th International Conference of the CLEF Initiative. CEUR Workshop Proceedings</source>
          ,
          <year>2016</year>
          . Vol-
          <volume>1609</volume>
          , urn:nbn:de:
          <fpage>0074</fpage>
          -
          <lpage>1609</lpage>
          -5.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kaiming</surname>
            <given-names>He</given-names>
          </string-name>
          , Xiaoyu Zhang, Shaoqing Ren,
          <string-name>
            <given-names>Jian</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In CVPR</source>
          , pages
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          ,
          <year>2016</year>
          .
          <volume>2</volume>
          ,
          <issue>3</issue>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Krizhevsky</surname>
            <given-names>Alex</given-names>
          </string-name>
          , Sutskever Ilya,
          <string-name>
            <given-names>Hinton</given-names>
            <surname>Geoffrey</surname>
          </string-name>
          .
          <article-title>Image classification with deep convolutional neural networks</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Karen</surname>
            <given-names>Simonyan</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very Deep Convolutional Networks for LargeScale Image Recognition</article-title>
          .
          <source>In NIPS</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Wei</surname>
            <given-names>Liu</given-names>
          </string-name>
          , Andrew Rabinovich, Alexander C.
          <article-title>Berg</article-title>
          . ParseNet:
          <article-title>Looking wider to see better</article-title>
          .
          <source>In ILCR</source>
          ,
          <year>2016</year>
          . 2.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Limin</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Sheng Guo, Weilin Huang, Yuanjun Xiong,
          <string-name>
            <given-names>Yu</given-names>
            <surname>Qiao</surname>
          </string-name>
          .
          <article-title>Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution CNNS</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          <volume>26</volume>
          (
          <issue>4</issue>
          ),
          <fpage>2055</fpage>
          -
          <lpage>2068</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Sheng</surname>
            <given-names>Guo</given-names>
          </string-name>
          , Weilin Huang,
          <string-name>
            <surname>Limin</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Yu</given-names>
            <surname>Qiao</surname>
          </string-name>
          .
          <article-title>Locally-Supervised Deep Hybrid Model for Scene Recognition</article-title>
          .
          <source>IEEE Transactions on Image Processing</source>
          <volume>26</volume>
          (
          <issue>2</issue>
          ),
          <fpage>808</fpage>
          -
          <lpage>820</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>