SVM classification of moving objects tracked by Kalman filter and Hungarian method Gábor Szűcs, Dávid Papp, Dániel Lovas Department of Telecommunications and Media Informatics, Budapest University of Technolo- gy and Economics, Magyar Tudósok krt. 2., H-1117, Budapest, Hungary, szucs@tmit.bme.hu, pappdavid27@gmail.com, lovas.daniel@simonyi.bme.hu Abstract. Fishery video data often require laborious visual analysis, therefore a video-based fish identification challenge is announced in the LifeCLEF cam- paign for automatic fish categorization and enumeration. We have elaborated a complex system to detect, classify and track objects (fishes) in underwater video by examining each image frame of it. For the detection process we used background subtraction and morphologic methods, and then our solution cal- culated bounding boxes based on object contours. We used Kalman filter to track the moving objects, but an additional matching method was required to pair the objects in consecutive time periods because of many fishes. We used Hungarian method for this matching problem. We categorized the detected fishes with C-SVC classifier, as an advanced SVM (Support Vector Machine) classifier. The classifier used high level descriptors, which are based on the ex- tracted SURF vectors in each object. For optimization the C-SVC classifier we conducted a preliminary test, and we used the best parameters for teaching the classifier. We predicted the fish species in the official test video set, and our predictions were evaluated officially by NCS (Normalized Counting Score). Keywords: SVM method, fish classification, tracking, Kalman filter, Hungarian method 1 Introduction The analysis of video data usually requires very time-consuming and expen- sive input by human observers, and this is true for underwater videos as well, although the statistics of data collection would be very useful for exploratory applications, in particular for fisheries and biological areas. This analytical "bottleneck" greatly restricts the use of the powerful video technologies and demands effective methods for automatic content analysis to enable proac- tive provision of analytical information; and in order to solve this problem a challenge is announced in the LifeCLEF [1] campaign of ImageCLEF [2]. In this challenge two datasets (training data set with ground truth and a test set) were released for the video-based fish identification task [3]. The goal was to automatically count fish per species in video segments (e.g., video X contains N1 instances of fish of species 1, ..., Nn instances of fish species n). We have divided the problem into subtasks: object detection, classification and tracking, where the objects were the fishes; and we have implemented a video analysis system to solve these tasks. The applied methods and our so- lution are described below. 2 Object detection, classification and tracking system Object classification and tracking are different tasks, but both of them based on object detection (fish detection) in images of videos. We have imple- mented fish detection in OpenCV in such way that the bounding boxes of detected fishes are stored as small images with corresponding information (actual timestamp, identifier, etc). The common subtask in both of the prob- lems is mapping, i.e. interconnection of bounding box images and fish identi- fiers (these identifiers are generated for only distinguishable aim), because the results of the mapping can be used for classification and tracking as well. Thus the bounding box images are input for classification method, which estimates the species of fishes. The consecutive images with common fish identifiers can be classified into different species; therefore the final decision of classification in our solution is based on majority voting. 2.1 Detection of many fishes The one of the challenges in the fish detection was the observation of many objects in an image. For object detection at first we have used background subtraction [4], in order to separate the foreground from background. The most common morphological methods (erosion, dilation and the combina- tion of them, the closing) have been applied in order to get smooth and solid edges [5]. After this smoothing an algorithm for contours of objects has been applied, that is evolved by Suzuki and Abe [6]. Using the contours our solu- tion calculates the bounding boxes and the object centres. Some of detected objects were too small to substantively use in classification, hence those ob- jects that were smaller then 15x15 pixel were filtered out (deleted). 2.2 Fish tracking with Kalman filter and Hungarian method After the detection in our solution Kalman filter [7][8] has been used to track objects in three steps: (i) initialization, and after that there is a cycle process with (ii) prediction and (iii) correction. Initialization: at the first frame, where any detection was, the Kalman filter was initialized; and for every detected object an identity number and a con- fidence value (CFV) were attached. Prediction: In this step a prediction was made by Kalman filter on each de- tected object (using the calculated object centre) to forecast the future posi- tion of the investigated fish. Correction: After prediction the new detections (in next frame of the video) give the measurements (which are used in the comparison of the measure- ments with predictions). These measurements were used for correct the Kalman filter objects. In order to reach the best tracked-measured coupling we applied the Hungarian method [9][10], completed with a restriction that we removed those objects that not belong to a new measurement. We pre- sent this applied method below. Let vi be a tracked object, where i = 0...k and let wj be a newly detected ob- ject where j = 0...l. Before our solution used the Hungarian method a Mkxl matrix was calculated, where M[i,j] denotes the Euclidean distance (meas- ured in pixels in the image) of vi and wj objects. The rows and columns with higher elements than a given threshold were removed to prevent false matching. If a row corresponding to vi was removed from M, then we low- ered the confidence value (CFV) of that particular tracking. On the other side removing wj means that we should track this object, i.e. probably this is a new object (a new Kalman filter has been created for this object), because this is far from all others. At this point the Hungarian method were performed on M which resulted the optimal object tracking (v,w) pairs, i.e. minimal sum of distances. Fig. 1. Mechanism of Hungarian method The Fig. 1. illustrates the mechanism of the applied Hungarian method, where the vi objects are representing by a set of black and grey vertices (T) and wj object by red and pink ones (N). The thick edges connect the matched (v,w) pairs, whereas grey and pink colour denote those nodes that have no pair. After the Hungarian method the matched object pairs will be the input of the correction phase of Kalman filter (black vertices are tracked objects and the red ones are the measurements). 2.3 Fish classification 2.3.1 Elaboration of image descriptors The first part of the classification process is the representation of each image based on their visual content. This consists of three steps: (i) feature detec- tion, (ii) feature description, (iii) image description as usual phases in com- puter vision. Feature detection and description: Lots of different feature types can be de- tected in an image, e.g. corners, edges, ridges, as “interesting” part of an image. In our solution we have used Fast-Hessian Detector [11] to determine the “key points” in each image, and SURF (Speed Up Robust Features) [11] descriptor to extract local information at each key point. The SURF is based on Lowe’s SIFT (Scale Invariant Feature Transform) [12][13] with the expecta- tion to be faster and more robust. Both of these methods are widely used in practice and in theoretical works (as well) with some possible further devel- opment; but we have chosen SURF, because in videos the processing speed is more important. A SURF descriptor vector belongs to only one interesting point of an image, but an image possesses many feature descriptor vectors, which should be aggregated into an image descriptor. Image description: The next step of creating the representation is the com- pletion of high-level representation of each image. We have applied BoW (bag-of-words) model [14][15] for this purpose, where images are treated as text documents. According to this, “visual words” (so called “codewords”) in images need to be defined from feature descriptors. The whole set of code- words gives the codebook (similarly to dictionary in text tasks). To determine the codebook we clustered the SURF descriptors with K-means [16] algo- rithm, and the resulting cluster centers will represent the codewords, since a centroid represents similar feature descriptors. We have experimented with different cluster sizes (k=5000 and k=10000), these are discussed later. At this point, the codebook with codewords was available for further calcula- tions, which can be considered as a concise representation of the whole im- age set. According to the codebook the next step is to create a descriptor that specifies the distribution of the visual codewords in any image, called high-level descriptor. We have built histograms as high-level descriptors for each image:  Let ( ) ( ) ( ) be the initial histogram of the rth image, where denotes the size of the codebook (each element represent a codeword in H).  We performed 1NN (1-nearest neighbour) algorithm for each SURF descriptor to find the closest codeword (based on Euclidean distance), then the corresponding element of H was incremented by 1, where i cycles through the descriptors created for the rth image 2.3.2 Training the classifier For the classification task we have divided the labelled image set into two subsets: training and test set. We used the first one to create the codebook, and train the classifiers, and the latter for preliminary testing. The histo- grams were created for each training image, then we performed a variation of SVM (Support Vector Machine), the C-SVC (C-support vector classification) [17][18] with linear kernel function to train the classifier (classification mod- el). The SVM is basically a binary linear classifier, thus in order to extend it to a number of classified categories, the one-against-all technique was used. During this method a binary classifier was created for each category in the training set. We have executed kFold cross-validation technique before the preliminary test to determine the parameter of the C-SVC classifier. (After the training, the codebook was already available.) 2.4 Preliminary test of classification For the maximization the goodness of classification part we have conducted a preliminary test, as validation phase of the learning process. The labelled image set was divided into train set (11221 images) and preliminary test set (11220 images). The training phase was executed by different parameters according to the number of codewords sizes (5000 and 10000), the number of dimensions of SURF vectors (64 and 128), and the number of SURF vectors in an image (80, 200, and 500). Besides the accuracy we have measured the speed of the algorithm as well, and the results can be seen in Table 1 and 2. Table 1. Results of the preliminary testing at 5000 codewords number of dimensions: 128 number of dimensions: 64 run time (s) accuracy run time (s) accuracy vectors: 80 N.A. 0.259 N.A. 0.253 vectors: 200 9744 0.644 8672 0.595 vectors: 500 11863 0.663 9406 0.629 After the preliminary testing at 5000 codewords the case of 10000 code- words was testing with only 200 and 500 SURF vectors, because 80 vectors has resulted in very poor accuracy. Table 2. Results of the preliminary testing at 10000 codewords number of dimensions: 128 number of dimensions: 64 run time (s) accuracy run time (s) accuracy vectors: 200 12532 0.611 11179 0.227 vectors: 500 15791 0.625 13927 0.381 The best result of the preliminary testing is the 0.663 at case of 128 dimen- sions, 500 vectors and 5000 codewords. The run time of the training phase was long, but the largest part (92-94%) of this was the SVM teaching, which was important for good accuracy, while the other parts are fast (creating codewords: 3-6%, histogram calculation: less, than 1%); furthermore the test phase was also quick (10 minutes). The run time values were measured on a PC (with Intel Core i7-4770K processor, 16 GB RAM and SSD). For the prediction of official test we have chosen the SVM model with the best parameters (128 dimensions, 500 vectors and 5000 codewords), and we have submitted 2 prediction files (as results of 2 runs). The used method of the runs was the same (as described above), only small difference was be- tween the two submissions. The first run (BME TMIT RUN1) contains false de- tections, because of the glances (blinks) and the low level threshold in detec- tion (higher level threshold could avoid these); while at the second run (BME TMIT RUN2) these detections were filtered out, and corresponding predic- tions in the crucial videos were deleted. 3 Evaluation 3.1 Evaluation metrics In the official evaluation the normalised counting score is measured (instead of accuracy as in our preliminary testing). The counting score (CS) is defined as can be seen in Equation (1), where d is the difference between the num- ber of occurrences in the run (per species) and the number of occurrences in the ground truth (Ngt). d  CS  e Ngt (1) The precision (Pr) is defined as Pr= TP/(TP+FP) with TP and FP being, respec- tively, the true positives and the false positives. The normalised counting score (NCS) is defined as NCS = CS x Pr. 3.2 Final official results Our final official results for each fish species can be seen in Table 3., where the occurrences of fish species in the test set and the NCS (Normalized Counting Score) results are presented. Table 3. Occurrences and NCS (Normalized Counting Score) results at fish species Fish species ID Occurrences in BME TMIT BME TMIT (identifier) the test set RUN1 RUN2 1 93 0.00 0.05 2 129 0.03 0.12 3 517 0.21 0.29 4 1876 0.07 0.19 5 0 1.00 1.00 6 1317 0.42 0.58 7 24 0.08 0.19 8 1985 0.43 0.59 9 5016 0.02 0.05 10 0 0.85 0.85 11 118 0.02 0.09 12 1531 0.34 0.38 13 0 0.99 1.00 14 700 0.00 0.03 15 187 0.83 0.86 In cases, there were no occurrences of a given species and in the runs no fish were observed of those species, CS = 1 (since d = 0) and Pr was equal to 1. It can be seen in Table 3 that our solution has perfectly found this absence at two species (among three fish species: 5, 10, 13). The aggregated official results of different fish species can be seen in Table 4. (the last column is the product of previous ones), and we can conclude that the second run of our submissions was better. Table 4. Our final official results Run identifier Counting score Precision Normalized Counting Score BME TMIT RUN1 0.62 0.44 0.27 BME TMIT RUN2 0.67 0.51 0.34 4 Conclusion We have elaborated a complex system to detect and track objects (fish) in underwater video by examining each image frame of it. For the detection process we used background subtraction and morphologic methods, and then our solution calculated bounding boxes based on object contours. The first time we detect an object, we assign a Kalman filter to it, which is able to predict the possible location of that object in the next frame. We likely de- tect that particular object (the same fish) in the next frame also, but we should not apply a new Kalman filter. To deal with this, we used Hungarian method to pair the existing Kalman filters with the new “candidate” ones. Then we erased the candidate Kalman filters with matching pair, and kept the single ones. This way we were able to track the detected objects. We also categorized the detected fishes with C-SVC classifier; however this required representing the objects based on visual information. For this purpose, our system calculated SURF descriptors for each object, and then clustered them with K-means algorithm. Our solution built histograms for each fish based on the resulting cluster centres (according to BoW model), and these histograms were the input of the C-SVC. For optimization the C-SVC classifier we con- ducted a preliminary test, and we used the best parameters for teaching the classifier. We predicted the fish species in the official test video set based on our implemented model, and the number of occurrences of each fish species was enumerated. At the official evaluation (by the second submission) we reached 0.34 value of NCS (Normalized Counting Score). References 1. Joly, A., Müller, H., Goeau, H., Glotin, H, Spampinato, C., Rauber, A., Bonnet, P., Vellinga, W. P., Fisher, B.: LifeCLEF 2015: multimedia life species identification challenges, Proceed- ings of CLEF 2015 (2015) 2. Villegas, M., Müller, H., Gilbert, A., Piras, L., Wang, J., Mikolajczyk, K., Herrera, A. G. S., Bromuri, S., Amin, M. A., Mohammed, M. K., Acar, B., Uskudarli, S., Marvasti, N. B., Al- dana, J. F., García, M. M. R.: General Overview of ImageCLEF at CLEF2015 Labs. Lecture Notes in Computer Science, Springer (2015) 3. Spampinato, C., Fisher, B and Boom, B.: LifeCLEF Fish Identification Task 2015, CLEF working notes 2015 (2015) 4. KaewTraKulPong P. and Bowden, R.: An Improved Adaptive Background Mixture Model for Real- time Tracking with Shadow Detection, In Proc. 2nd European Workshop on Ad- vanced Video Based Surveillance Systems, AVBS01. Sept (2001 ) 5. Fisher, R., Perkins, S., Walker, A., Wolfart, E.: Mathematical Morphology, (2003) http://homepages.inf.ed.ac.uk/rbf/HIPR2/matmorph.htm 6. Suzuki, S. and Abe, K.,: Topological Structural Analysis of Digitized Binary Images by Bor- der Following. Computer Vision, Graphics, and Image Processing, 30(1), 32-46. (1985) 7. Kalman, R. E.: A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering, 82(1), 35-45. (1960) 8. Welch, G. F.: Kalman Filter. Computer Vision: A Reference Guide, 435-437. (2014) 9. Kuhn, H. W.: The Hungarian method for the assignment problem. Naval research logistics quarterly, 2(1‐2), 83-97. (1955) 10. Frank, A.: On Kuhn's Hungarian method—a tribute from Hungary. Naval Research Logis- tics (NRL), 52(1), 2-5. (2005) th 11. Bay, H. and Tuytelaars, T. and Van Gool, L.: SURF: Speeded Up Robust Features, 9 Euro- pean Conference on Computer Vision, (2006) 12. Lowe, D.: Object recognition from local scale-invariant features. In: ICCV (1999) 13. Lowe, D. G.: Distinctive Image Features from Scale-Invariant Keypoints, International Journal of Computer Vision, 60, 2, pp. 91-110, (2004) 14. Fei-Fei, L., Fergus, R., & A. Torralba, A.: Recognizing and Learning Object Categories, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), (2007) 15. Lazebnik, S., Schmid, C. and Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, New York, Vol. 2, pp. 2169-2178 (2006) 16. MacQueen, J.: Some methods for classification and analysis of multivariate observations, Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1, pp. 281-297 (1967) 17. Boser, B., Guyon, I., Vapnik, V.: A Training Algorithm for Optimal Margin Classifier, Proc. of the 5th Annual ACM Workshop on Computational Learning Theory, pp. 144-152 (1992) 18. Cortes, C., Vapnik, V.: Support-vector networks, Machine Learning, Vol. 20, No. 3, pp. 273-297 (1995)