=Paper=
{{Paper
|id=Vol-1307/paper6
|storemode=property
|title=An Application of Shape-Based Level Sets to Fish Detection in Underwater Images
|pdfUrl=https://ceur-ws.org/Vol-1307/paper6.pdf
|volume=Vol-1307
|dblpUrl=https://dblp.org/rec/conf/gsr/RavanbakhshSSMH14
}}
==An Application of Shape-Based Level Sets to Fish Detection in Underwater Images==
GSR_3 Geospatial Science Research 3. School of Mathematical and Geospatial Science, RMIT University December 2014 An Application of Shape-Based Level Sets to Fish Detection in Underwater Images Mehdi Ravanbakhsh (mehdi.r@rmit.edu.au) Mark R. Shortis (mark.shortis@rmit.edu.au) RMIT University, GPO Box 2476, Melbourne, VIC 3001 Australia Faisal Shaifat (faisal.shafait@uwa.edu.au) Ajmal Mian (ajmal.mian@uwa.edu.au) The University of Western Australia, 35 Stirling Hwy, Crawley, WA 6009 Australia Euan S. Harvey (euan.harvey@curtin.edu.au) Curtin University, GPO Box U1987, Perth, WA 6845 Australia James W. Seager (jseager@seagis.com.au) SeaGIS P/L, PO Box 1085, Bacchus Marsh, VIC 3340 Australia Abstract Underwater stereo-video technology systems are used widely for measurement of fish. However the effectiveness of the stereo-video measurement has been limited because most operational systems still rely on a human operator. In this paper, an automated approach for fish detection using a shape-based level sets framework is presented. Shape knowledge of fish is modelled by Principal Component Analysis (PCA). The Haar classifier is used for precise position of the fish head and snout in the image, which is vital information for close proximity initialisation of the shape model. The approach has been tested on under-water images representing a variety of challenging situations typical of the underwater environment, such as background interference and poor contrast boundaries. The results obtained demonstrate that the approach is capable of overcoming these limitations and capturing the fish outline at sub-pixel accuracy. Keywords: image segmentation, fish detection, under-water image, level sets, prior shape knowledge, registration Introduction The monitoring of fish for stock assessment in aquaculture, commercial fisheries and in the assessment of the effectiveness of biodiversity management strategies such as Marine Protected Areas and closed area management is essential for the economic and environmental management of fish populations. Video based techniques for fishery independent and non-destructive sampling are now widely accepted. The advantages of using stereo-video for counting the numbers of fish, measuring their lengths and defining the sample area have been well demonstrated (Shortis et al., 2009). However the effectiveness of the stereo-video measurement has been limited because most operational systems still rely on a human operator to identify and measure the snout and tail of the fish in order to determine the length by intersection. Whilst automation of identification of objects and image measurement processes have been demonstrated in many other contexts, due to the uncontrolled underwater environment combined with the loss of contrast because of attenuation through the water, an automated solution for fish sizing has been elusive. Whilst automation of some aspects of the process has been established for at least 15 years (Lines et al., 2001), only recently have fully operation systems that identify, delineate, track and measure fish in an uncontrolled environment been reported (Shortis et al., 2013). The ultimate aim of this research is to develop a general approach to the automatic measurement of fish in underwater environments. The focus of this work will be on identification and delineation of Southern Bluefin Tuna (SBT). In context of this research, automated detection methodologies comprise two steps: identification and subsequent delineation of the fish outline. The existing literature on fish detection has mainly focused on the identification step where the presence of fish is recognised in the scene followed by the estimation of the fish location (Palazzo et al., 2013; Spampinato et al., 2008; Walther et al., 2004; Zhou and Clark, 2006; Morais et al., 2005; Evans et al., 2003). In contrast, relatively few approaches have been reported that deal with both identification and the following delineation of the fish silhouette (Khanfar et al., 2010; Lines et al., 2001; Hariharakrishnan & Schonfeld, 2005). Most of these approaches use low-level image features such as colour, texture, intensity and motion to detect fish. However, in a real life, the uncontrolled underwater environment produces images that are characterised by low contrast, background clutter and interference, partial occlusion caused by adjacent or foreground objects, varied illumination conditions and shadows. The aforementioned research works fail to produce high quality results mainly due to misleading low-level features resulting from image noise and occlusion, or lack of sufficient low-level features necessary for object modelling. High-level knowledge of the shape of the fish can significantly aid in providing an efficient solution to these problems. In this paper, an automated approach for fish detection using a shape-based level sets framework is presented. An example of under-water stereo images used is shown in Figure 1. The prior knowledge of the shape of the fish is modelled using Principal Component Analysis (PCA) (Leventon et al., 2000) and this knowledge is used to guide the level set curves. PCA enables the representation of global shape variation of the object of interest through a training set of shape templates. The global shape information is incorporated into the Mumford-Shah functional, as reported by Chan and Vese (2001), which can detect objects in strongly cluttered scenes. A Haar- like detector method (Lienhart and Maydt, 2002) is used to identify the existence of fish and determine their locations in the image. This information is vital to place the initial shape in close proximity to the object to be segmented, which increases the success rate and requires less iteration for convergence. Once the fish are independently identified on the left and right images, stereo intersections for the snout and tail is computed based on the well-established approach of a geometrically constrained epipolar search and template match between the two images. Figure 1: Typical stereo image pair captured during a transfer from the purse-seine net to the grow-out cage. The water surface is to the right of the images and the apparent vertical orientation of the fish is caused by the mounting of the stereo-video system on the side of the transfer gate. The outline of the paper is as follows. In the following section, a short review of level sets is given followed by the description of the individual steps of the proposed detection strategy along with mathematical equations in the subsequent section. Then, experimental results using underwater sample image sequences recorded in cages are presented and evaluated. The paper concludes with a discussion of the progress and results achieved, and an outlook for future work. Level Set Representation The core idea of level sets is to implicitly represent a contour C as the zero level curve of a function of higher dimension (Figs. 2-a & 2-b). An initialisation of can be constructed in the following way: Let C be a closed curve representing the boundary between two regions, one region inside the curve and another region outside the curve. φ is then defined as the signed distance ±d(x) to the curve, negative inside and positive outside. The definition is illustrated: (1) (a) (b) Figure 2: Illustrating level sets. (a) The curve C (red) is used to construct the level set function such that is negative inside and positive outside the curve. Distance values d are grey value coded. (b) A plane at zero level (Z=0) intersects the level set function , and thus the zero level curve C is obtained. While the use of the distance d(x) is not mandatory when using level sets, it assures that does not become too flat or too steep near C and subsequently can be differentiated across the zero level curve without running into numerical problems. In order to combine the characteristics of the level set function, image information and shape knowledge of the known object, an energy functional can be set up and consequently minimised using the calculus of variations. Minimising the energy functional is performed in an iterative process moving the initial curve towards the object boundaries. Detection Strategy The fish detection strategy comprises three primary steps (Figure 3). First, the presence of fish is recognised and the initial locations are determined using segmentation of a frame difference from an averaged background image. A Haar like detector is then employed to estimate the snout and tail locations, from which the initial position and orientation of each fish in the image can be derived. Subsequently, a shape prior model is constructed by PCA using a set of training samples. The level sets curve is then initialised and evolved to locate the fish boundary. The result consists of the detected fish. Figure 3: Workflow of fish detection Identification In this stage, the location of fish snout and tail in the image are determined. Precise localisation of the snout and tail leads to the estimation of pose parameters in 2D space, these being two rotations, two translations and one scale parameter. In this research, the Haar classifier is used to locate the fish snout and tail. To train the classifier, 200 manually cropped images of the target object (snout or tail) are used so that the classifier can learn which features (among a set of possibly thousands of features) can locate the target with high accuracy. These features, once learned, are then used to construct the object classifier that can locate the presence of the object in cluttered scenes. Due to their high detection speed and ability to perform a scale-space search, Haar classifiers are employed in this research for locating snout and tail of fish in underwater image sequences. The results of independent detection of the snout and tail using Haar detectors are further improved by using the expected distance and angle relationships between the detected snouts and tails. The search space for tail detection is based on the results of the snout detection and vice versa. Figure 4 shows an example of a the result from the Haar classifier used to identify the snouts and tails and of Southern Bluefin Tuna (SBT) during a transfer. Precise localisation of the tip of the snout and the valley point of the tail, used as reference points, are used to estimate the rigid transformation parameters. These transformation parameters are then used to first generate the reference shape and subsequently initialise the shape model, two crucial steps in accurate and correct delineation of fish. Figure 4: Shows the identification of SBT snouts and tails marked by circles using Haar classifier. Shape Prior Generation The generation of initial shape, also called shape prior, comprises two steps: first, the training samples need to be geometrically aligned, and subsequently, the shape model is constructed from the aligned shapes. The alignment involves matching shapes of training samples that differ in size, orientation and translation. In the literature, a large number of shape matching methods have been reported. A complete review of those methods is given in Veltkamp and Hagedoorn (1999). In this paper, the alignment of training samples is realised using the method introduced in Chen et al. (2002). Suppose that the training set contains n given curves C1, ..., Cn with their corresponding interior regions A1, ..., An. The shape similarity measure of the shapes C1 and C2 is defined as: a (C1, C2) = area of (A1 ⋃ A2 − A1 ∩ A2) (2) In the alignment process, the pose parameter of C1 is considered to be fixed, and the rest of samples (C2,..., Cn) are jointly aligned to C1 through the solution of the rigid transformation Cjnew = sj Rj Cj + Tj (j=2, ..., n) such that the area a(C1, Cjnew) is minimised. These values are obtained by a global optimisation algorithm called the genetic algorithm (Davis, 1991), which makes it less likely for the underlying function to be trapped in suboptimal local minimum compared with purely local methods such as gradient descent. The shapes are encoded in binary images to simplify the alignment task. Figure 5 shows a set of 20 training samples manually digitised and the result of their alignment. The first sample (bottom-row, left-most), which is the scaled, shifted and rotated version of the corresponding sample manually digitised (top-row, left-most), is adopted as the reference. It has fixed pose parameters estimated in the identification process and to which the rest of samples are registered. Figure 6-a & 6-b show the amount of shape variability depicted in the overlap images before and after the alignment. It can be seen that even large shape discrepancies can often exist in real fish images. These shape differences can be removed successfully which demonstrates the effectiveness of the alignment method. Furthermore, model variability is represented in Figure 6-c showing that the areas around the boundaries of the fish fin and tail experience the largest deformations in the fish body outline. It is interesting to note that key regions that could be used for species identification, such as the dorsal and anal fins and the tail, are the profile sections which show the greatest variability. Figure 5: Top-row shows binary representation of training samples of fish shapes. Bottom-row presents the training samples after geometric alignment. (a) (b) (c) (d) Figure 6: (a) Overlaid training samples with varying degrees of overlap before alignment; (b) Aligned samples; (c) Average of aligned shapes; (d) Showing model variability which are gray-value coded with white and black representing highest and lowest variability respectively. In the next step, a shape model is constructed using the aligned shapes. The PCA method is selected to construct the shape model due to its efficiency at capturing the main variations of a training set while removing redundant information. Similar to Leventon et al. (2000), the boundaries of each of the training shapes are represented in the training dataset as the zero level set of n Signed Distance Functions (SDFs) {ϕ1... ϕn} with negative distances assigned to the inside and positive distances assigned to the outside of the shape boundary. Suppose M is a matrix whose column vectors are the n aligned training SDFs {ϕi}, PCA is then applied to these SDFs to compute eigenvalues and eigenvectors of the covariance matrix: (3) and the mean level set function of the training set (4) The eigenvectors are called principal components or eigenshapes. In practice, the first k principal components (k ≤ i) are sufficient to model the major shape variations in the training samples. In Milka et al. (1999), a method is proposed for determining the value of k by examining the eigenvalues of the corresponding eigenvectors. This approach however cannot be adopted here as the value of k varies in different applications (Tsai et al., 2003). In this work, the value of k was set empirically. Then, shape can represented as zero level set of the following function (5) where w = {w1... wk} denote the weights for the k eigenshapes with the variances of these weights { σ21... σ2k} given by the eigenvalues. In the equation (5), the shape variability is restricted to the variability given by the eigenshapes. To accommodate wider range of shape variability, pose parameters p, these being translation, scale, orientation, are incorporated to the level set function of (5). With the addition of p, the implicit description of shape is given by the zero level set of the following function (6) where and each are now a function of p. Once the shape model is generated, an initial level set function is constructed using a rectangle curve around the detected fish. Then, the zero level set of the level set function is evolved towards the fish boundary according to the energy functional. The energy functional is described in the following section. Shape-Based Level Sets Energy Functional The energy functional is based on the segmentation model proposed by Chan and Vese (2001) in an effort to overcome limitations found with the previous edge-based strategies. Unlike edge-based methods where the provision of close initialisation to the object of interest and good contrast boundaries are necessary to locate those boundaries, region-based methods used in this work are independent of image gradients and less likely to converge to local minima if an undesirable feature or image noise is present. Let I be a given image and C the evolving curve defined as C = {(x,y) R2: }, with u and v denoting two constants representing the averages of I inside and outside the curve C. Assume that the image I is formed by two regions of approximately piecewise-constant intensities with distinct values of I0i and I0o, and that the object to be detected is represented by the region with value I0i and boundary C. Then, I0 ≈ I0i inside the object (inside C) and I0 ≈ I0o outside the object (outside C). By minimizing the following energy equation, the boundary of the object of interest C is obtained (Chan and Vese, 2001) (7) which is equivalent to the energy functional below (Tsai et al., 2001) (8) where Au and Av denote areas, and Su and Sv represent the sum intensity of areas inside and outside C. Then, the gradient descent is employed to search for the parameters w and p that minimise Ecv to implicitly determine the segmenting curve C. The parameters Au, Av, Su and Sv can be expressed in terms of ; (9) and ; (10) where defines a bounded and open subset of R2 and H denotes the Heaviside function (11) The energy function (8) is minimised with respect to w and p using gradient descent optimisation (12) (13) where the gradient parameters are given as (14) (15) (16) (17) where the segmenting curve C is given by the zero level set of , and is the gradient of taken with respect to the ith component of the transformation matrix p that includes translation, rotation and scale. The gradient descent optimisation of the equations (12&13) leads to the parameters w and p. The updated w and p parameters, which are iteratively computed during the optimisation, are then used to implicitly determine the location of the segmenting curve C. The curve evolution is terminated when the overall change in the evolving curve positions per iteration is less than 0.1 pixels. A smaller threshold considerably increases the computation cost, although the quality of the final result is the same. Experimental Evaluation Underwater image sequences recorded at the transfer gate between two cages have been used to test the fish detection algorithm. From the large number of video samples recorded for 8 transfers, 35 sample images have been chosen to represent the variable and uncontrolled nature of the marine environment. These images include a varying number of SBT with a range of illumination changes, background interference and occlusions caused by adjacent fishes. Moreover, SBT appear in the image sequences with missing or poor contrast boundaries which further exacerbates the challenging conditions. In Fig 7, an example of results is shown where the initial curve is placed as a rectangle around the fish of interest and subsequently converged to the fish boundary by minimising the energy functional presented in the previous section. Further example results are shown in Figure 7 where, in the four right-most samples, SBT are partially occluded by other neighbouring fishes in foreground and background. Almost in all samples, fish boundaries are of low contrast especially in areas around the tail and fin. The detection results shown in Fig.7 demonstrate that the approach is capable of overcoming those limitations typical of the underwater environment and capturing the fish outline accurately. (a) (b) n=3 (c) n= 10 (d) n=13 (e) n=54 (f) n= 32 (g) n=126 (h) n=154 (i) n= 71 (j) n= 91 Figure 7: Fish detection result. (a) Initial curve; (b), (c) and (d) show the intermediate curves and (e) represents the final detection result. (f), (g), (h), (i) and (j) show the detection results of different fish in the presence of a range of background interference and foreground occlusions by other fish (two rightmost samples). n denotes the number of iterations in the intermediate and the final results. In order to quantitatively evaluate the performance of the approach, the detection results were compared to manually plotted fish used as reference data. The comparison was carried out by matching the detection results to the reference data using the so-called buffer method (Heipke et al., 1998). A detected object is assumed to be correct if the maximum distance between the detected object and its corresponding reference does not exceed the buffer width. Furthermore, a reference object is assumed to be matched if the maximum deviation from the detected object is within the buffer width. Based on these assumptions the following quality measures were used in our work: • Completeness: is the ratio of the number of matched reference objects to the whole number of objects. • Correctness: is the ratio of the number of correctly detected objects to the number of detected objects. • Geometric accuracy: is the average distance between the correctly detected objects and its corresponding reference expressed as root mean square (RMS) value. Table 1 shows the evaluation result of the fish detection. The buffer width can be defined according to the required detection accuracy for a specific application. In our tests, the buffer was set to 3, 5 and 8 pixels according to the range of accuracy achievable at the identification step. Furthermore, this selection allows assessment of the relevance of the approach for applications that demand varying degrees of accuracy. From the buffer width value 3 pixels to 8 pixels, both the completeness and correctness have increased implying that the results are more complete and correct for higher buffer width values. The geometrical accuracy increases in inverse proportion to the buffer width value, so that results obtained with a value of 3 pixels are more accurate than those obtained with a larger buffer width value. Buffer width (pixel) Correctness (%) Completeness (%) Geometric accuracy (pixel) 3 89.6 91.4 0.7 5 94.3 94.3 0.8 8 100 100 0.9 Table I: Evaluation results for fish detection applied on 35 samples As expected, the results are encouraging, but whilst sub-pixel geometric accuracy has been achieved in all experiments with high rates of completeness and correctness, severe deformation taking place around the fins and the tail of the fish cannot be absorbed with the current approach. The table nevertheless shows that the developed approach is in principle capable of extracting fish accurately under occlusion and within variable underwater environments. Accurate extraction of the shape is important for fish biomass estimation, length measurement and species recognition (Shortis et al., 2013). In each case an accuracy of one pixel would be sufficient to establish the initial conditions, so even the least favourable accuracy result in the table above would still be acceptable and simultaneously provide a high level of correctness and completeness. Conclusion and Outlook In this paper, an automated approach for the detection of fish from under-water images has been proposed, developed and tested. It comprises a region-based level set method that enables the delineation of the fish outline. The shape information of fish is incorporated into the level sets formulation through the PCA method to overcome such limitations as poor contrast boundaries, background clutter and occlusions caused by neighbouring fish. To provide a close initialisation for the shape model, the pose of fish in the image is determined using the Haar classifier. The results of the developed approach have been applied to 35 samples of varying quality and occlusion level and presented a quantitative evaluation of the results using three buffer width values. The presented results show that level sets can be used to delineate fish outlines from under-water images if the shape information of the fish species is incorporated into the level sets energy functional. Furthermore, it was found that an energy function that is independent of image gradients and includes the shape model is able to overcome various kinds of disturbances and the problems related to low quality images recorded in the underwater environment, such as poor contrast and uneven illumination. The current approach has been developed to detect SBT in an aquaculture environment. The techniques developed here have clear potential to be extended to wild habitats provided that the perspective deformation of the fish body and movement information derived from image sequences are taken into account. In wild habitats, fish can move in any direction with large deformations occurring in the image of the body, causing this fish detection approach to break down. For the technique to be successful in wild habitats, varying rates of deformation and fish orientation need to be modelled. The detection of different fish species in addition to SBT is another goal that will be pursued in future research, as in reef and other underwater habitats many fish species are present. Furthermore, investigation into the possibility of using colour information in the level sets formulation will be carried out. References Chan, T.F. and Vese, L.A., 2001. Active contours without edges. IEEE Trans. on Image Processing, 10(2): 266–277. Chen, Y., Tagare, H., Thiruvenkadam, S., Huang, F., Wilson, D., Gopinath, K., Briggsand, R. and Geiser, E., 2002. Using prior shapes in geometric active contours in a variational framework. International Journal of Computer Vision, 50(3): 315-328. Davis, L.,1991. Handbook of Genetic Algorithms. Van Nostrand: 100 pages. Evans, F., 2003. Detecting fish in underwater video using the EM algorithm. Proceedings of the 2003 IEEE International Conference on Image Processing, 3: III – 1029–1032. Hariharakrishnan, K. and Schonfeld, D., 2005. Fast object tracking using adaptive block matching. IEEE Transactions on Multimedia 7(5): 853–859. Heipke, C., Mayer, H., Wiedemann, C. and Jamet, O., 1998. External evaluation of automatically extracted road axes. Photogrammetrie, Fernerkundung, Geoinformation, 2: 81–94. Khanfar, H., Charalampidis, D., Ioup, G., Ioup, J. and Thompson, C. H., 2010. Automated recognition and tracking of fish in underwater video. Final Report, LA Board of Regents Contract NASA(2008)-STENNIS- 08: 40 pages. Leventon, M., Grimson, W. and Faugeras, O., 2000. Statistical shape influence in geodesic active contours. IEEE International Conference of Computer Vision and Pattern Recognition, 1: 316–323. Lienhart, R. and Maydt, J., 2002. An extended set of Haar-like features for rapid object detection. Proceedings, IEEE International Conference on Image Processing, 1:900-903. doi: 10.1109/ICIP.2002.1038171 Lines, J.A., Tillett, R.D., Ross, L.G., Chan, D., Hockaday, S. and McFarlane, N.J.B., 2001. An automated image-based system for estimating the mass of free-swimming fish. Journal of Computers and Electronics in Agriculture, 31(2): 151–168. McInerney, T. and Terzopoulos, D., 1995. Topologically adaptable snakes. Proceedings of the Fifth IEEE International Conference on Computer Vision: 840–845. Mika, S., Schӧlkopf, B., Smola, A., Müller, K.R., Scholz, M. and Rӓtsch, G., 1999. Kernel PCA and de-noising in feature spaces. Advances in Neural Information Processing Systems, MIT Press, 11: 536–542. Morais, E.F., Campos, M.F.M., Padua, F.L.C. and Carceroni, R.L., 2005. Particle filter-based predictive tracking for robust fish counting. 18th IEEE Brazilian Symposium on Computer Graphics and Image Processing: 367–374. Palazzo, S., Kavasidis, I. and Spampinato, C., 2013. Covariance based modeling of underwater scenes for fish detection. Proceedings of IEEE International Conference on Image Processing, Melbourne, Australia. Paper 3591, 5 pages. Available at http://groups.inf.ed.ac.uk/f4k/PAPERS/ICIPcs13.pdf Shortis, M. R., Harvey, E. S. and Abdo, D. A., 2009. A review of underwater stereo-image measurement for marine biology and ecology applications. In Oceanography and Marine Biology: An Annual Review, Volume 47, Gibson, R. N., Atkinson, R. J. A. and Gordon, J. D. M. (Editors). CRC Press, Boca Raton FL, USA. ISBN 978-1-4200-9421-3. 342 pages. Shortis, M.R., Ravanbakhsh, M., Shafait, F., Harvey, E.S., Mian, A., Seager, J.W., Edgington, D, Cline, D. and Culverhouse P., 2013. A review of techniques for the identification and measurement of fish in underwater stereo-video image sequences. Videometrics, Range Imaging, and Applications XII, SPIE Vol. 8791, paper 0G. The International Society for Optical Engineering, Bellingham WA, USA. Spampinato, C., Chen-Burger, Y.-H. , Nadarajan, G. and Fisher, R., 2008. Detecting, Tracking and Counting Fish in Low Quality Unconstrained Underwater Videos, 2: 514–519. Tsai, A., Yezzi, A., Wells, W., Tempany, C., Tucker, D., Fan, A., Grimson, W.E. and Willsky, A., 2003. A shape-based approach to the segmentation of medical imagery using level sets. IEEE Transactions on Medical Imaging, 22(2): 137–154. Tsai, A., Yezzi, A., Wells, W., Tempany, C., Tucker, D., Fan, A., Grimson, W. and Willsky, A., 2001. Model- based curve evolution techniques for image segmentation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1: 463–468. Veltkamp, R. and Hagedoorn, M., 1999. State-of-the-art in shape matching. Technical Report UU-CS-1999-27, Utrecht University, Sept. 1999. Walther, D., Edgington, D. and Koch, C., 2004. Automated video analysis for oceanographic research. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition: 544–549. Zhou, J. and Clark, C., 2006. Autonomous fish tracking by ROV using monocular camera. The 3rd Canadian Conference on Computer and Robot Vision: 8 pages.