Exploring the use of local descriptors for fish
           recognition in LifeCLEF 2015

Jorge Cabrera-Gámez, Modesto Castrillón-Santana, Antonio Domı́nguez-Brito,
  Daniel Hernández-Sosa, Josep Isern-González, and Javier Lorenzo-Navarro

                                      SIANI
                    Universidad de Las Palmas de Gran Canaria
                                       Spain
                           jcabrera@iusiani.ulpgc.es
                    http://berlioz.dis.ulpgc.es/roc-siani


      Abstract. This paper summarizes the proposal made by the SIANI
      team for the LifeCLEF 2015 Fish task. The approach makes use of stan-
      dard detection techniques, applying a multiclass SVM based classifier on
      large enough Regions Of Interest (ROIs) automatically extracted from
      the provided video frames. The selection of the detection and classifica-
      tion modules is based on the best performance achieved for the valida-
      tion dataset consisting of 20 annotated videos. For that dataset, the best
      classification achieved for an ideal detection module, reaches an accuracy
      around 40%.

      Keywords: Local descriptors, score level fusion, SVM based classifica-
      tion


1   Introduction
There are different scenarios of application where underwater monitoring is a
required ability such as biological, fisheries, geological and physical surveys. The
everyday larger availability of media captured in this environment poses the
challenge to extract useful data automatically. This is indeed a hard scenario
where effective techniques are needed to reduce costs and human exposition.
    With this aim, CLEF presented in 2014 for the first time LifeCLEF: the Labs
dedicated to multimedia life species identification [10], including FishCLEF: a
video-based fish identification task. The short term goal was simply to auto-
matically detect any fish and its species. The medium term goal is to provide
researchers tools to automatically monitor species with high accuracy, in or-
der to extract information of living species for a sustainable development and
biodiversity conservation.
    This year the Labs [3] and task have been reedited [11, 14]. The participants
could initially access to training data, to later submit labels for the test set. The
task to be accomplished was “count fish per species in video segments”.
    This paper describes the approach adopted by the SIANI team. The following
sections detail the different elements integrated that basically perform initially
a detection to later identify the fish species of the cropped image.
2     The approach

As succinctly mentioned above, the fish identification task has been decomposed
into two phases: detection and classification.


2.1   Detection

The goal of the detection phase is to reduce the searching area extracting can-
didate ROIs from the video stream. Three different foreground detection ap-
proaches have been tested: fast, histogram backprojection and Gaussian Mixture
Modeling (GMM).


Fast. This approach makes use of a simple and fast background model computed
from the video frames, that is robust enough for the detection and extraction
problem in some scenarios. This background modeling solution takes advantage
of the static camera configuration in this particular scenario.
    To define the scene background model, bg, we have used a similar method
to commonly-used techniques like mean filter or median filter [13]. We compute
the mode image, I, that is calculated as the most frequent values in each RGB
component of each pixel each pixel along the video frames.
    Once the background model is available, simple and fast background subtrac-
tion techniques may be applied in each RGB component of the input image, I.
The foreground is computed based on a defined threshold applied to the sum of
squares of RGB components of the subtracted image (DR , DB , DG ) pixel value

                   S(i, j) = DR (i, j)2 + DG (i, j)2 + DB (i, j)2               (1)

where Dx = Ix − bgx for every RGB component (x = R, G, B).
   For a pixel in a given image, I(i, j), its corresponding pixel in the foreground
image, f g, is computed as
                                    
                                      I(i, j) if S(i, j) ≥ τ1
                        f g(i, j) =                                              (2)
                                        0 otherwise
    The definition of the threshold (τ1 ) in equation 2 has been determined by the
application scenario. The extracted pixels are considered foreground, i.e. region
of interest in the detection problem.


Histogram Backprojection (BackProj). The second detection method eval-
uated is inspired in the idea proposed in [16] that we have adapted to background
segmentation. The method is based on the backprojection of temporal color his-
togram, and comprises the following steps:

 1. Calculate for each color component the temporal histogram of every image
    pixel: hx = histt (Ix (i, j))
 2. Add to each histogram bin, k, the values of its neighborhood, ±s, (convolu-
                            Pk+s
    tion mask): cx (k) = l=k−s (hx (l))
 3. Normalize the resulting histogram: hx (k) = cx (k)/max(cx )
 4. Backproject the histogram on every image: Px (i, j) = hx (Ix (i, j))
 5. Sum the squares of values of each component of the pixels: S(i, j) = PR (i, j)2 +
    PG (i, j)2 + PB (i, j)2
 6. Use a threshold to separate the foreground of the background:
                                        
                                          I(i, j) if S(i, j) < τ2
                            f g(i, j) =                                          (3)
                                            0 otherwise

   Also in this case, the definition of the threshold (τ2 ) in equation 3 has been
determined by the application scenario.


GMM Based Background Modeling (GMM). The third background sub-
traction method analyzed is the one proposed by Zivkovic and van der Heij-
den [20]. This method performs a pixel-level background subtraction, modeling
each background pixel with a GMM, extending the method proposed by Stauffer
and Grimson [15]. Thus the background model is defined as:
                                           B
                                           X
                                                                   2
                         p(x|XT , bg) ≈          π̂m N (x; µ̂m , σ̂m Id)                     (4)
                                           m=1

where XT = {x(t) , . . . , x(t−T ) } is the training set, for the time period T , µ̂1 , . . . , µ̂B
are the estimates of the means, σ̂1 , . . . , σ̂B are the estimates of the variances,
and Id is the identity matrix. B is the number of components weighted by π̂m .
    An optimization process was launched over the training videos to try to find
a suitable configuration for the GMM foreground detection algorithm, including
the number of distributions, the background ratio and the number of training
frames and learning rates.


2.2    Classification

Detected ROIs are fed to the detection phase to identify the fish species. The
classification phase has been designed based on local descriptors, that are cur-
rently well known techniques in different Computer Vision (CV) problems.
    In texture analysis, an image is described in terms of a local descriptor codes
making use of a histogram, hi , where the bins contain the number of occurrences
of the different descriptor codes present in the image. This approach follows a
Bag of Words scheme [6]. For some problems, the use of a single histogram may
introduce the loss of spatial information. To avoid this effect, a grid of cells is
used defining the number of horizontal and vertical cells, respectively cx and cy,
making a total of cx × cy cells on the analyzed pattern.
    Once defined the grid setup, for a particular descriptor, d, the resulting fea-
ture vector, xdI , contains the concatenation of cx × cy cell histograms, i.e. the
feature vector is defined as xdI = {h1 , h2 , ..., hcx×cy }, where hi is the descriptor
histogram for cell i.
    In this particular task, we have evaluated different descriptors and grid con-
figurations. In this sense, we have considered the following 8 descriptors:

 – Histogram of Oriented Gradients (HOG) [7].
 – Local Binary Patterns (LBP) and uniform Local Binary Patterns (LBPu2 ) [1].
 – Local Gradient Patterns (LGP) [12].
 – Local Ternary Patterns (LTP) [17].
 – Local Phase Quantization (LPQ) [18].
 – Weber Local Descriptor (WLD) [5].
 – Local Oriented Statistics Information Booster (LOSIB) [8].


3     Results

This section describes the results obtained for the different fish identification
task phases, highlighting those configurations that were submitted to the 2015
Lab focused on this particular problem [14].


Fig. 1. From left to right, a training sample, and two validation samples of Abudefduf
vaigiensis. They are presented in similar relative scale.


3.1   Datasets

Before granting the access to the test data, the organizers provided two datasets,
see Figure 1. Even though a better description of the data may be found in [14],
we summarize some relevant characteristics below.
    The first dataset, that we call the training dataset, is a collection of cropped
images of the different fish species. The second collection contains annotated
videos, including media that may present a similar scenario to the test data. We
called this collection the validation dataset.
    This validation dataset is used in the following subsections to analyze the
different detection and classification alternatives, providing a cue to decide the
final system setup chosen for the Fish task submission. In fact, we used both
training and validation datasets to select the classification approach, and the
validation dataset, to select the detection technique and tune its parameters.
    Briefly, the training set contains samples of the 15 different fish species, i.e.
classes. The number of samples per species is indicated in Table 1. The reader
will observe that the different species are not equally represented through the
dataset, circumstance that also is present in the validation and test sets. The
average dimension in the training samples is 88 ± 38 × 102 ± 49 pixels.


      Table 1. Number of samples per class in the training and validation datasets.

                                        Number of instances per dataset
                      Fish type         Training       Validation
                 Abudefduf vaigiensis      305            132
               Acanthurus nigrofuscus     2511            294
                  Amphiprion clarkii      2985            363
                 Chaetodon lunulatus      2494           1217
                 Chaetodon speculum         24            138
                Chaetodon trifascialis     375            335
                  Chromis chrysura        3593            275
                  Dascyllus aruanus        904            894
                 Dascyllus reticulatus    3196           3165
               Hemigymnus melapterus       147            214
                  Myripristis kuntee      3004            242
               Neoglyphidodon nigroris     129             85
               Pempheris Vanicolensis       49            999
              Plectrogly-Phidodon dickii 2456             737
                  Zebrasoma scopas         271             72
                         Total           22443           9162


    The validation dataset contains 9162 samples distributed per class according
to the last column of Table 1. The average dimension of those samples is 52 ±
37 × 56 ± 39 pixels.


3.2     Detection Results

As mentioned above, the annotated validation dataset videos were used to an-
alyze the performance of different detection algorithms. The detection rates for
the three implementations are shown in Table 2, being computed as the total
number of correct or positive detections divided by the number of annotations.
The false detection rate presented is also the ratio between the number of un-
matched or false detections and the number of annotations. This was done to
have a clear evidence of the number of false detections in relation to the number
of annotations. False detections do not necessarily mean a failure in the detec-
tion module, but that there is not annotation for that particular frame and ROI.
Indeed, the annotations were done only when the fish species was clearly iden-
tifiable [14]. In this sense, we have made use of the minimal size for annotation,
and applied a dimension filter to remove small detected ROIs.
    A positive detection is considered when there is a significant intersection
between a given detection container, B and an annotation container, A. As con-
fidence measure, we employed the Jaccard Index, JI. This index relates the
                                                       A∩B
intersection of both containers with their union, JI = A∪B , providing a value
between 0 and 1, larger values meaning better matching. For the analysis sum-
marized in Table 2, we have considered 0.4 and 0.5 threshold values.


Table 2. Detection rates considering different detection techniques and JI thresholds.

                   Detection Detection rate (false detection rate)
                   algorithm JI=0.5               JI=0.4
                     Fast I 0.80(4.60)          0.85(4.55)
                     Fast II 0.74(3.74)         0.80(3.67)
                   BackProj 0.82(2.77)          0.88(2.49)
                     GMM          -             0.40(2.49)


    The high variability of the video segments made extremely difficult to obtain
a good tuning of the algorithm parameters. As a consequence, simple approaches
yielded better results both in execution time and detection. Indeed, among the
techniques analyzed, both Fast and BackProj algorithms provided not brilliant
but acceptable detection rates. Fast was chosen with different tuning parameters
to setup run1, while BackProj was used for the other two submitted runs (run2
and run3 ). The detection approach is later combined with the classifier providing
the best performance in the validation dataset classification.


3.3   Classification Results

The detection rates achieved, described in the previous section, allowed us to
explore our model based approach on the dataset. Certainly, a model approach
is not a priori the best solution for the unbalanced classification task, but being
newcomers, we were interested in applying our experience in other CV problems
to evaluate local descriptors in this scenario.
    The analysis described in this section presents results in two steps. Firstly,
the study evaluates different descriptors with the training set, i.e. the collection
of cropped images, see Table 1. Secondly, the best descriptors are later evaluated
with the validation dataset, to adopt the most promising configuration for the
test set.
    Table 3 summarizes the results for different local descriptors in a 5-fold cross
validation experiment defined on the provided training dataset, considering a
single multi-class SVM based classifier. This kind of approach has already been
applied for the task [2]. Each descriptor is evaluated for different grid config-
urations in the ranges cx ∈ [1, 4] and cy ∈ [1, 4]. Unfortunately, for the given
deadline (its extension was not evident), we could not manage to evaluate all
the grid configurations with dimensions larger than 3 × 3.
Table 3. Mean accuracy achieved in a 5-fold cross validation experiment on the training
dataset. The cell color serves to cluster accuracies.

                                            Grid setup
                         Descriptor
                                      1×1 2×2 3×3 4×4
                            HOG       54.69 87.71 95.66 97.50
                           LBPu2      77.79 94.11 96.72 96.88
                            LBP       88.20 91.62 85.98 65.58
                            LGP       61.57 84.00 92.06 95.13
                            LTP       89.90 86.18 34.91 26.60
                            LPQ       90.74 95.70 88.73 53.87
                            WLD       16.01 16.01 15.92 13.30
                           LOSIB      40.32 73.13 87.63 92.28


    With the exception of WLD, the whole collection of descriptors reported a
high accuracy at least for a particular grid setup. However, this was not the
case in the following analysis on the annotated ROIs extracted from the valida-
tion dataset, as summarized in Table 4. It seems, that the grid configuration is
important for some descriptors. A larger number of cells is preferred for HOG,
LBPu2 , LGP and LOSIB, while other descriptors such as LBP, LTP and LPQ
provide better results with a lower number of cells. Again, WLD is not providing
useful classification results.

Table 4. Accuracy achieved with a single descriptor training with the training dataset
and testing with the extracted annotated samples on validation dataset.

                                            Grid setup
                         Descriptor
                                      1×1 2×2 3×3 4×4
                            HOG       19.25 28.24 33.55 36.76
                           LBPu2      19.09 25.92 33.08 31.18
                            LBP       20.91 18.94 12.88 4.99
                            LGP       14.85 22.36 28.17 30.84
                            LTP       18.55 16.88 4.63 3.00
                            LPQ       22.15 20.85 15.72 6.48
                            WLD        3.00 3.00 3.00 3.96
                           LOSIB      12.55 19.21 27.42 28.47


    Considering the vital importance of combining several descriptors [19], a fur-
ther evaluation of a fusion approach was considered. According to the score level
(SL) fusion literature and previous results in the context of facial processing [4,
9], we adopted a score level fusion approach where the first layer is composed by
a set of classifiers designed according to the chosen descriptors, while the sec-
ond layer classifier takes the first layer scores as input. In summary, the fusion
alternatives analyzed below follow the approach outlined in Figure 2.
    Table 5 summarizes the results achieved for different fusion alternatives. The
selection of descriptors and grids are based on the single descriptors results
      Fig. 2. Illustration of two stage classification fusion architecture, with n clas-
      sifiers in the first stage whose scores are fed into a second stage meta classifier.


achieved for the validation dataset, see again Table 4. This table does not include
results with 4 × 4 grids, as they were not available in time for the deadline.
    The first three alternatives combine the best descriptors for a given grid
resolution; the higher the resolution, the better the accuracy. However, there is
no real restriction to make use of a unique grid resolution. For that reason we
also evaluated the fusion of the best descriptors with different grid resolutions,
achieving the best overall accuracy with a RBF kernel using the first 90 PCA
components.
    For each combination, the selected descriptors and grid setups are indicated,
reporting the results using SVM based classifiers, including linear and RBF
kernels, with and without a previous dimensionality reduction by means of a
Principal Component Analysis (PCA).
    The best performing classifier, the fourth combination using RBF kernel
with a PCA based features, is used in combination with the selected detection
approaches.


3.4     Discussion

As mentioned above, our team submitted three runs. They all made use of an
identical second phase based on a two stages classifier. The selected descriptors
combination is the one highlighted in Table 5. This score fusion selection contains
six single descriptor classifiers in the first stage: LTP1×1 , LPQ1×1 , HOG3×3 ,
LBPu23×3 , LGP3×3 and LOSIB3×3 . The second stage makes use of the classifiers
scores, that are projected into a PCA space.
    Each run differs in its detection phase. Our first run made use of the Fast
detection algorithm, while the other two integrate the BackProj detector with
different parameters setup.
Table 5. Accuracy achieved with fusion approaches training with the image folder and
testing with the extracted annotated samples on the folder of validation videos.

                        Descriptors              SVM Approach Accuracy
                                                      RBF      24.56
                         u2                        RBF+PCA     25.45
     HOG1×1 +LBP1×1 +LBP1×1 +LTP1×1 +LPQ1×1
                                                     Linear    27.42
                                                  Linear+PCA   27.21
                                                      RBF      20.44
                         u2                        RBF+PCA     29.54
     HOG2×2 +LBP2×2 +LBP2×2 +LTP2×2 +LPQ2×2
                                                     Linear    21.06
                                                  Linear+PCA   29.69
                                                      RBF      35.63
                    u2                             RBF+PCA     38.65
        HOG3×3 +LBP3×3 +LGP3×3 +LOSIB3×3
                                                     Linear    36.96
                                                  Linear+PCA   38.75
                                                      RBF      34.99
                            u2                     RBF+PCA     40.41
LTP1×1 +LPQ1×1 +HOG3×3 +LBP3×3 +LGP3×3 +LOSIB3×3
                                                     Linear    36.43
                                                  Linear+PCA   39.74


    The normalized counting scores of the referred runs in the overall Lab anal-
ysis are reported in Figure 3. Two teams are over 50%, followed at a remarkable
distance by the best runs of other two teams, including our run3, achieving over
30%. Our main focus was on the classification phase, that has provided unbal-
anced results for different classes, likely due to the non homogeneous number
of training samples per class. A focus based exclusively on the fusion of local
descriptors seems not to be reliable enough for the problem. However, the de-
tection phase requires further attention as a larger number of proper detections
would improve the overall score.


4   Conclusions

This document describes the model based approach submitted to the LifeCLEF
2015 Fish task by the SIANI team. The proposal explores the use of local de-
scriptors for this problem. We employed standard detection techniques to later
apply an ensemble of SVM multiclass classifiers.
    Three runs were submitted with identical classification stage. One is based on
the Fast detection algorithm, while the other two are based on the BackProj
algorithm.
    The best accuracy achieved for the ideal annotated containers reaches 40%,
suggesting that the approach is still far from being reliable in this scenario. In
the close future, our aim is on the one side at improving detection, that might be
combined with tracking. On the other side, once we have observed the problems
originated in the multiclass classification of an unbalanced dataset, and apart
  Fig. 3. Normalized counting score of the participant runs. Extracted from [14].


from computing more dense grids, we should explore the combination with other
techniques to leverage the classification stage.

Acknowledgments. Work partially funded by the Institute of Intelligent Sys-
tems and Numerical Applications in Engineering and the Computer Science De-
partment at ULPGC.


References
 1. Ahonen, T., Hadid, A., Pietikäinen, M.: Face description with local binary pat-
    terns: Application to face recognition. IEEE Transactions on Pattern Analysis and
    Machine Intelligence 28(12) (December 2006)
 2. Blanc, K., Lingrand, D., Precioso, F.: Fish species recognition from video using
    svm classifier. In: CLEF (Working Notes). pp. 778–784 (2014)
 3. Cappellato, L., Ferro, N., Jones, G., San Juan, E. (eds.): CLEF 2015 Labs and
    Workshops, Notebook Papers. CEUR Workshop Proceedings (CEUR-WS.org)
    (2015), ISSN 1613-0073, http://ceur-ws.org/ Vol-1391/.
 4. Castrillón, M., Lorenzo, J., Ramón, E.: Improving gender classification accuracy
    in the wild. In: 18th Iberoamerican Congress on Pattern Recognition (CIARP)
    (2013)
 5. Chen, J., Shan, S., He, C., Zhao, G., Pietikainen, M., Chen, X., Gao, W.: WLD:
    A robust local image descriptor. Pattern Analysis and Machine Intelligence, IEEE
    Transactions on 32(9), 1705–1720 (September 2010)
 6. Csurka, G., Dance, C.R., Fan, L., Willamowski, J., Bray., C.: Visual categorization
    with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision,
    ECCV. pp. 1–22 (2004)
 7. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
    Schmid, C., Soatto, S., Tomasi, C. (eds.) International Conference on Computer
    Vision & Pattern Recognition. vol. 2, pp. 886–893 (June 2005)
 8. Garcı́a-Olalla, O., Alegre, E., Fernández-Robles, L., González-Castro, V.: Local
    oriented statistics information booster (LOSIB) for texture classification. In: In-
    ternational Conference in Pattern Recognition (ICPR) (2014)
 9. Heisele, B., Serre, T., Poggio, T.: A component-based framework for face detec-
    tion and identification. International Journal of Computer Vision Research 74(2)
    (August 2007)
10. Joly, A., Goëau, H., Glotin, H., Spampinato, C., Bonnet, P., Vellinga, W.P.,
    Planque, R., Rauber, A., Fisher, R., Müller, H.: Lifeclef 2014: Multimedia life
    species identification challenges. In: nformation Access Evaluation. Multilingual-
    ity, Multimodality, and Interaction, Lecture Notes in Computer Science Volume,
    vol. 8685, pp. 229–249. Springer (2014)
11. Joly, A., Müller, H., Goëau, H., Glotin, H., Spampinato, C., Rauber, A., Bonnet,
    P., Vellinga, W.P., Fisher, B.: Lifeclef 2015: multimedia life species identification
    challenges. In: Proceedings of CLEF 2015 (2015)
12. Jun, B., Kim., D.: Robust face detection using local gradient patterns and evidence
    accumulation. Pattern Recognition 45(9), 3304–3316 (2012)
13. Lo, B., Velastin, S.: Automatic congestion detection system for underground plat-
    forms. In: Proceedings of 2001 International Symposium on Intelligent Multimedia,
    Video and Speech Processing. pp. 158–161 (2001)
14. Spampinato, C., Fisher, B., Boom, B.: Lifeclef fish identification task 2015. In:
    CLEF working notes 2015 (2015)
15. Stauffer, Grimson: Adaptive background mixture models for real-time tracking. In:
    IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
    pp. 246–252 (1999)
16. Swain, M.J., Ballard, D.H.: Color indexing. International Journal on Com-
    puter Vision 7(1), 11–32 (1991), http://www.springerlink.com/content/
    n231l41541p12l1g/
17. Tan, X., Triggs, B.: Enhanced local texture feature sets for face recognition under
    difficult lighting conditions. Image Processing, IEEE Transactions on 19(6), 1635
    – 1650 (2010)
18. V, O., J., H.: Blur insensitive texture classification using local phase quantization.
    In: Proc. Image and Signal Processing (ICISP). pp. 236–243 (2008)
19. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for
    classification of texture and object categories: a comprehensive study. International
    Journal of Computer Vision 73(2), 213–238 (2007)
20. Zivkovic, Z., der Heijden, F.: Effcient adaptive density estimation per image pixel
    for the task of background subtraction. Pattern Recognition Letters 27, 773–780
    (2006)