-

SeaCLEF 2016: Ob ject proposal classi cation for sh detection in underwater videos

Jonas Jager

Jonas.Jaeger@et.hs-fulda.de 0 1

Erik Rodner

Erik.Rodner@uni-jena.de 0

Joachim Denzler

Joachim.Denzler@uni-jena.de 0

Viviane Wol

Viviane.Wolff@et.hs-fulda.de 1

Klaus Fricke-Neuderth

Klaus.Fricke-Neuderth@et.hs-fulda.de 1 0 Computer Vision Group, Friedrich Schiller University Jena , Germany 1 Department of Electrical Engineering and Information Technology, Fulda University of Applied Sciences , Germany

This working note describes the results of CVG Jena Fulda team for the sh recognition task in SeaCLEF 2016. Our method is based on convolutional neural networks applied to object proposals for detection as well as species classi cation. We are using background subtraction proposals that are ltered by a binary SVM classi er for sh detection and a multiclass SVM for species classi cation. Both SVM's utilize CNN features extracted from AlexNet. With this pipeline we achieve a recognition precision of 66% and a normalized counting score of 58% on the provided test dataset. We also show that classi cation of background subtraction proposals works much better for sh detection than background subtraction on its own.

Object proposals R-CNN CNN features Fine-grained classi cation Fish detection

This paper presents the participation of the CVG Jena Fulda team in the SeaCLEF 2016 Task 1. This task deals with automatic sh recognition of coral reef species in low resolution videos. All shes are presented in their natural unrestricted habitat. See Fig. 1 for example frames.

This task is important to enhance computer vision methods for biodiversity applications. Many scientists in the eld of ecology collect large amounts of video data to monitor biodiversity in their speci c applications. But manual analysis of this data is time consuming and requires knowledge of rear human experts, which makes it impossible to evaluate data in a large scale. However, this large scale analysis is essential to obtain the knowledge to save ecosystems that have a large impact on the human population. Therefore, tools for automatic video analysis need to be developed to support the work of ecologists.

We have a special interest in this task because our team works on a closely related problem [ 7 ]. In our application we deal with high resolution underwater video analysis of sh species at the Adriatic sea in Croatia.

We noticed that detection is a crucial part in a sh classi cation and counting system. But we also experienced that sh detection is a di cult problem, due to lighting changes and the complex background in a natural environment. Therefore, we focus in the following paper on robust sh detection.

Last years participants [ 2, 3 ] in this task used median image background subtraction for sh detection. Boom et al. [ 1 ] also utilized background subtraction methods and post processed detection results with an objectness lter to remove bad detections. In contrast to that, we classify sh proposals by CNN features [ 4 ].

In this work we propose the use of object proposal classi cation for sh detection. Object proposals are obtained by background subtraction and then classi ed into sh and background by a binary support vector machine (SVM). For sh recognition we utilize a multiclass SVM trained for the 15 considered species. Both SVM's are using the same CNN features, extracted from AlexNet [8], for prediction.

Our detection approach is very similar to the idea of region-based convolutional neural networks (R-CNN) presented by Girshick et al. [ 6 ]. In contrast to [ 6 ] we are using the background subtraction method of Stau er and Grimson [12] instead of selective search [14] for proposal generation, since we can exploit time information in the video data. Another di erence is that we do not apply domain speci c ne tuning to the CNN. 2

Fish Dataset

The provided dataset: The dataset contains videos and images of sh species in their natural coral reef environment. It is divided into a training and a test set. Example frames from six di erent videos are shown in Fig. 1.

The provided training set consists of 20 low resolution videos and more than 20000 sample images of 15 sh species. There are 5 videos with a resolution of 640 480 pixels and 15 videos with 320 240 pixels. All videos are annotated by two human experts with bounding boxes and species names.

The test set contains 73 videos with a resolution of 320 240 pixels. Dataset preparation: We split the given training videos into two parts with 10 videos each. One part will be used as validation set. The other 10 videos and all sample images will be used for training and will be called training set in the rest of this paper. 3 3.1

Method Overview

Our main idea is to build a sh detector and to use detections for species classi cation. Since the application of background subtraction methods on its own leads to a large number of false detections, we use background subtraction to get sh proposals and classify each proposal as as sh or background. Then, all sh detections are classi ed as one of 15 species or rejected. Both classi eres, for detection and species recognition, are using the same features. 3.2

Object proposal classi cation for sh detection

Our sh detection approach consists of three steps. (1.) Generation of bounding box proposals. (2.) Extraction of CNN features for each proposal. (3.) Classi cation of each bounding box proposal as sh or background. Please see Fig. 2 for illustration of these steps.

In step (1.) we use the background subtraction algorithm of Stau er and Grimson [12], which uses a probabilistic background model that represents each pixel as a mixture of Gaussians. The result of this algorithm is a binary mask that indicates which pixels are background (see Fig. 2a). This mask is further used to obtain a second background mask (see Fig. 2b) by applying an erosion lter to it, which allows us to separate nearby shes.

After that we apply the blob detection method of Suzuki and Abe [13] to both masks to get bounding box proposals (see Fig. 2c). Bounding boxes that have a smaller area than 100 pixels are removed, since these proposals are to small for species classi cation.

(2.) Now we use the generated proposals to extract CNN features from AlexNet [8], that was pretrained on ILSVRC 2012 [11]. As features we choose the activations of the 7th hidden layer (relu7 ) in the convolutional network. Note that we did not ne tune the convolutional net by training it with sh images. (a) Background subtraction mask (b) Eroded mask (c) Object proposals (d) Boxes that where classi ed as sh

(3.) Based on these features we utilize a binary SVM for classifying each bounding box proposal as sh or background (see Fig. 2d). Then we choose from all sh detections the boxes with a con dence level that is higher or equal to 0.5. In oder to obtain a probability measure from SVM scores we used Platt scaling [10] as implemented in scikit-learn [9].

As a post processing step we apply non-maximum suppression to remove duplicate boxes for all shes.

Detector training: To train our detector we extract CNN features (see step (2.)) and t the SVM classi er to the classes background and sh. As training data we utilize the sh sample images of the training set (see section 2) and extract all annotated shes from the 10 training videos. As background examples we generate object proposals from training set videos and extract those boxes that have no intersection with a ground truth sh box. 3.3

Species classi cation using CNN features

As in our previous work [ 7 ] we use CNN features [ 4 ] and a multiclass SVM for species prediction. We utilize the same CNN features that where extracted in detection step (2.) from AlexNet [8], which was pretrained on ImageNet. As features we choose the activations of the 7th hidden layer (relu7) in the network.

When the con dence level for a classi cation is lower than 0:5 we consider it as an unknown sh and reject it. In order to get probabilities from SVM scores we use the method of Wu et al. [15].

The SVM is trained with a one-vs-rest strategy for the 15 considered sh species. The training data are composed of the provided species sample images and all annotated shes cropped from training set (see section 2) videos. 4 4.1

Results Fish detection results

One of our main interest is how well object proposal classi cation (OPC) works for sh detection compared to background subtraction on its own. For that purpose we will rst describe the methods listed in results Tab. 1 and then de ne our Pascal VOC [ 5 ] like evaluation process. Finally we will discuss our sh detection results presented in Fig. 3 and Tab. 1. Methods: The rst method, called BgsMedian in Tab. 1, computes a median background image for all frames in a video and subtracts the current frame from that background image. A speci c pixel in the median image is calculated by using the median value of all pixels at same position in the video. This method was also used by last year participants [ 3, 2 ].

The second method, referenced as BgsGMM, was developed by Stau er and Grimson [12] and uses a probabilistic background model that represents each pixel as a mixture of Gaussians.

To obtain bounding boxes from these background subtraction methods we applied blob detection proposed by Suzuki and Abe [13].

OPC (BgsMedian) and OPC (BgsGMM) are using the pipeline described in section 3.2 with the exception that BgsMedian is used for bounding box proposal generation in OPC (BgsMedian).

In our Experiments we ne tuned the parameter of the background subtraction methods for sh detection when used on its own. When we used these methods for proposal generation the parameter have been adjusted to get many sh proposals.

Evaluation process: As in Pascal VOC [ 5 ], we consider a sh detection as correct (true positive) if the intersection over union ratio (iou) for a ground truth box with a predicted box is greater or equal to 0:5. If there is more than one predicted box that satis es this condition for a speci c ground truth box: Then one predicted box is considered as true positive and the remaining boxes are counted as false positives.

Discussion: Fig. 3 and Tab. 1 present detection results for the above mentioned methods. The OPC (BgsGMM) approach works best in our setup. For detection by background subtraction BgsGMM has a higher average precision score than BgsMedain. Whereby average precision of BgsGMM is 36:35% lower than OPC (BgsGMM).

In general it can be observed that OPC detection approaches work better than background subtraction in our setup, although the CNN was not ne tuned to sh images. 4.2

Species classi cation

For species classi cation we used the detections of OPC (BgsGMM) and extracted CNN features to classify each detection as one of the 15 considered sh species. If the con dence level for a classi cation was lower than 0:5 it was rejected.

With this pipeline we achieve a counting score (CS) of 83%, a precision of 66% and a normalized counting score (NCS) of 58% (see Fig. 4). CS and NCS are used as scoring functions in SeaCLEF 2016 and are de ned as: d CS = e Ngt (1) with d as the di erence between the number of ground truth occurrences Ngt and the predicted occurrences per species.

N CS = CS precision (2) This paper described our participation in SeaCLEF 2016 sh species recognition task. We focused on robust sh detection, since the simple application of background subtraction methods leads to a large number of false detections. Therefore we compared traditional background subtraction methods, mainly used for sh detection so far, with object proposal classi cation (OPC) for detection. We show that OPC sh detection (Fig. 2) works much better than background subtraction (Fig. 3) in our setup.

For species recognition we use the same CNN features as for detection and classi ed each sh with a multiclass SVM. Using this pipeline we achieve a normalized counting score of 58% and a precision of 66% (see Fig. 4) on the provided test dataset.

For the future we plan to incorporate sh tracking. We also want to use larger CNN models and ne tune these models to sh data. 8. Alex Krizhevsky, Ilya Sutskever, and Geo rey E. Hinton. Imagenet classi cation with deep convolutional neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097{1105. Curran Associates, Inc., 2012. 9. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825{2830, 2011. 10. John C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In ADVANCES IN LARGE MARGIN CLASSIFIERS, pages 61{74. MIT Press, 1999. 11. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision (IJCV), 115(3):211{252, 2015. 12. Chris Stau er and W. Eric L. Grimson. Adaptive background mixture models for real-time tracking. In 1999 Conference on Computer Vision and Pattern Recognition (CVPR '99), 23-25 June 1999, Ft. Collins, CO, USA, pages 2246{2252, 1999. 13. Satoshi Suzuki and Keiichi Abe. Topological structural analysis of digitized binary images by border following. Computer Vision, Graphics, and Image Processing, 30(1):32{46, 1985. 14. J.R.R. Uijlings, K.E.A. van de Sande, T. Gevers, and A.W.M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 2013. 15. Ting-Fan Wu, Chih-Jen Lin, and Ruby C. Weng. Probability estimates for multiclass classi cation by pairwise coupling. Journal of Machine Learning Research, 5:975{1005, 2003.

1. Bastiaan

Boom , Jiyin He, Simone Palazzo, Phoenix X. Huang, Cigdem Beyan, Hsiu-Mei

Chou

, Fang-Pang

Lin

, Concetto Spampinato , and Robert

Fisher . A research tool for long-term and continuous analysis of sh assemblage in coral-reefs using underwater camera footage . Ecological Informatics , 23 : 83 { 97 , 2014 .

Jorge

Cabrera-Gmez , Modesto Castrilln Santana, Antonio Domnguez-Brito, Daniel Hernandez-Sosa, Josep Isern-Gonzlez, and Javier Lorenzo-Navarro. Exploring the use of local descriptors for sh recognition in lifeclef 2015 . In Working Notes of the 6th International Conference of the CLEF Initiative. CEUR Workshop Proceedings , 2015 . Vol- 1391 , urn:nbn:de: 0074 - 1391 -8.

Sungbin

Choi . Fish identi cation in underwater video with deep convolutional neural network: Snumedinfo at lifeclef sh task 2015 . In Working Notes of the 6th International Conference of the CLEF Initiative. CEUR Workshop Proceedings , 2015 . Vol- 1391 , urn:nbn:de: 0074 - 1391 -8.

Donahue , Yangqing Jia, Oriol Vinyals, Judy Ho man, Ning Zhang, Eric Tzeng, and

Trevor

Darrell . Decaf: A deep convolutional activation feature for generic visual recognition . CoRR, abs/1310.1531 , 2013 .

Mark

Everingham ,

S. M.

Ali Eslami , Luc Van Gool, Christopher

K. I. Williams

, John Winn, and

Andrew

Zisserman . The pascal visual object classes challenge: A retrospective . International Journal of Computer Vision , 111 ( 1 ): 98 { 136 , 2015 .

6. Ross

Girshick , Je Donahue, Trevor Darrell, and Jitendra

Malik . Rich feature hierarchies for accurate object detection and semantic segmentation . CoRR, abs/1311.2524 , 2013 .

7. Jonas Jager, Marcel Simon, Joachim Denzler, Viviane Wol , Klaus FrickeNeuderth, and Claudia Kruschel. Croatian sh dataset: Fine-grained classi cation of sh species in their natural habitat . In T. Pltz S. McKenna T. Amaral , S. Matthews and R. Fisher, editors, Proceedings of the Machine Vision of Animals and their Behaviour (MVAB) , pages 6 .1{ 6 .7. BMVA Press, September 2015 .