Deep Visual Features Matching Method for Vehicle Model Re-Identification Nikolay Nemcev1[0000−0003−4801−3284] nicknemcev@gmail.com and Elena Vasilenko1[0000−0002−1914−7753] apfelrobbe@gmail.com ITMO University, Saint Petersburg, 197101, Russian Federation www.ifmo.ru Abstract. In this paper, a study of the existing methods for identifying and comparing the features of objects used in the task of a vehicle model re-identification by its image was conducted, which is one of the most important tasks facing automated traffic control systems, and solved by comparing the vehicle features being verified with a certain set of features obtained by the monitoring system earlier, and deciding whether the compared samples belong to the same vehicle model or to different ones. The article describes the approach for vehicle model re-identification ac- cording to its image, based on the method of feature vector extraction, using classification convolutional neural network, and on the criterion for feature vectors matching via the sub-counting corresponding fea- tures. The proposed method shows a lower computational complexity than modern analogous approaches, uses smaller feature vector, demon- strates comparable re-identification accuracy in scenarios when the test- ing data have characteristics that coincide with training ones (similar camera model and level of lighting and noise, models of re-identifiable vehicles are contained in the dataset used for training) and achieves significantly higher relative accuracy in cases when testing data very dif- ferent from the training dataset. The proposed approach is practically applicable in vehicle re-identification task for highly loaded traffic control systems. Keywords: Visual data processing · machine learning· convolutional neural networks · feature extraction · feature comparison · Alexnet 1 Introduction The computer vision algorithms are used to solve various tasks in the automotive industry: from detecting vehicles, measuring its speed and counting its number for creating environmental analysis functions for autonomous moving devices. The task of re-identifying vehicles is one of the most important ones that are being solved with traffic control systems, and it may be expressed through se- lection of vehicle’s features (both each one and some specific group of vehicles of Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 N. Nemcev et al. the same model) for its further comparison with some set of previously extracted features in order to determine the conformity of the samples. The task of re-identifying a car is related to the task of facial re-identifying, however, it also has its own specifics - so in the task of identifying a car it is necessary to ensure a reliable comparison of the car’s attributes regardless of the angle of view (front, side, back, at different angles), while face identification is usually done from one angle (generally in full face). Also, different car models can visually differ only from a certain angle (especially if car models from the same manufacturer are compared), and images of the same car, taken from different angles, may contain only few common details, which greatly complicates the task of re-identification [1]. Conditionally, all approaches for re-identifying objects can be divided into ones using classic feature extraction methods [2, 3], and ones based on the ap- proaches of feature extraction using convolutional neural networks [1, 4], mean- while neural network (based on the use of artificial neural networks) approaches can be divided into ones working according to the classical scheme in which features are determined for each compared image separately and its matching is placed in distinct module [5], and approaches based on the use of Siamese neural networks [6] in which two input images are processed in parallel within the same network, and the similarity metric is calculated directly on the last layer [1, 7]. Classical approaches of extracting and comparing features are hardly practi- cally applicable in the task of verifying a vehicle model due to the fact that in real use cases, comparisons are often made for vehicles taken at a considerable distance, images of which do not have a resolution high enough to highlight a significant number of features [8]; the procedure of comparing images of vehi- cles is being reduced to counting the number of matches between the received descriptors [9], what does not allow you to adjust the compared objects taken from different angles (sets of features obtained from different angles for a same vehicle model will differ). Neural network approaches for comparing of object in an image use classi- fication network architecture for extracting feature vectors of objects (f.e. [10, 11]) and process its further comparison in the separate module [5] or on the last network layer in case of siamese networks [6]. Main advantages of this approach include possibility of multi-angle comparison and absence of strict resolution requirements for the compared images. The main disadvantage is the feature extraction module’s necessity of training on corresponding data set, for exam- ple, for correct identification of characteristics of a particular car model classifier should be trained to classify the most complete set of car models, ideally con- taining model wanted to be verified, meanwhile the visual characteristics of used dataset should correspond to the real ones as much as possible (for reliable iden- tification of the car model in the night time, used classifier must be trained on a data set containing night photos)[12]. Beside that, the effect of “overfitting” is observed in all neural network ap- proaches, in which some model of the classifier, or of the verifier in this case, demonstrates good results on the training data set, but significantly loses accu- Method for vehicle model re-identification 3 racy on the data far different from the training [13]. Nevertheless, both feature set extraction and feature set comparison modules based on machine learning methods are prone to this effect [14], which considerably reduces the universality of such approaches. The approaches based on usage of siamese networks additionally to problems with “overfitting” have problems with network convergence at training stage caused by the heterogeneity of the input data [15], which complicates its usage in some cases. The proposed approach combines an approach for extracting of a short fea- ture vector based on the usage of a modified classification network Alexnet [10], trained on a specially prepared data set, and a simple similarity metric that operates on small feature vectors and based on the principle of estimating the number of matching features. The use of this metric is due both to the need to reduce the computational complexity of the task and to optimize the computa- tional process of re-identification of the vehicle model for use in highly loaded traffic control systems, and to reduce the ”overfitting” effect of the machine learning algorithms on stability the operation of a system that processes data significantly different in their characteristics from those used in the process of extracting features of the object training module [13]. The obtained results show that, despite its simplicity, the proposed approach demonstrates the accuracy of solving the problem of car model verification com- parable with other modern approaches with higher universality and less high computational complexity. 2 Feature extraction model The modified network Alexnet [10] was used as feature set extraction module in this paper. Instead of ReLU (Rectified Linear Unit) which can be calculated as: f (x) = max(0, x), (1) where x is the activation function input value, it was used RReLU (Randomized Rectified Linear Unit) [16] as the activation function:  x, if x ≥ 0, f (x) = , (2) α ∗ x, otherwise where α ⊂ U (l, u),l ¡ u, l, u ∈ [0, 1), U (l, u) are some uniform distribution (it was 2 used U (3, 8) distribution, and α = l+u at the testing stage). Classical ReLU can break down during training process [16], for example, large gradient passing through ReLU may lead to such an update of weights that the neuron will never be activated again [10]. If it happens, from this mo- ment gradient passing through this neuron will always be evaluated as 0 what negatively affects the effectiveness of classifier training. The researches presented in [16] showed that the usage of some ”loss” for x < 0 lets us both decrease the 4 N. Nemcev et al. possibility of neuron failing on training stage and some kind decrease ”overfit- ting” effect of the network due to the random nature of the parameter α. In order to accelerate the convergence of the network (reduce training time) and increase the stability of its operation it was used a standard approach at the training stage based on the addition of special normalizing layers (BacthNorm, batch normalization [17]). Network architecture Alexnet [10] was used because it’s investigated and has rather shallow depth (low count of hidden layers), alleviating process of classifier training and providing comparatively to other networks architectures low computing complication of feature extracting process. It was used StanfordCars [18] dataset train part, which contains 8,144 images of static cars of 196 different models (totally dataset contains 16,185 images), for classifier training. In purposes of increasing of classification accuracy, ”overfit- ting” effect decreasing and raising of classifier universality the source dataset was augmented by 10 times with additional data balancing was done [19] (was equal- ized amount of images of each car models up to 450 samples). The augmentation of the original images was carried out by using affine transformations, perspec- tive transformations, contrast changes, Gaussian noise, hue/saturation changes, cropping/padding, blurring. For each image set of used transformations was cho- sen randomly and parameters of each transformation was selected like values of uniform distribution with transformation-specific predefined range 1. Fig. 1. Example of a augmentation results Classification network receives on its input a relevant image to get an object’s features vector and extracts a set of coefficients after the last MaxPool layer represented as a value matrix sized by 256 × 6 × 6: P6 P6 (Ci,k,m − min(C)) Fi = k=1 m=1 , (3) max(C) − min(C) where Fi ∈ F is en element of a result feature vector containing 256 elements, Ci,k,m ∈ C is an element of coefficient matrix extracted from a network. Method for vehicle model re-identification 5 3 Feature vector similarity criterion Used feature vector similarity criterion receives on its input two object’s feature vectors F and F0 computes some similarity criterion S ∈ [0, 1]:  P256 J(Fi ,Fi0 ) P256 1 − Pi=1 , if i=1 I(Fi , Fi0 ) > 0,  256 S= i=1 0 I(Fi ,Fi ) , (4) 0.5, otherwise  where Fi ∈ F, Fi0 ∈ F0 , I(Fi , Fi0 ) and J(Fi , Fi0 ) are indicators computed as: 1, if Fi > thr or Fi0 > thr,  0 I(Fi , Fi ) = , (5) 0, otherwise I(Fi , Fi0 ), if |Fi − Fi0 | > 21 max(Fi , Fi0 ).  J(Fi , Fi0 ) = . (6) 0, otherwise 4 Assessment of the effectiveness of the proposed approach A comparative assessment of the effectiveness of the proposed criteria for the similarity of features is based on the evaluation of the Euclidean distance be- tween the compared vectors, an approach using the support vector method (SVM [14]), as well as an approach based on the use of siamese neural networks [6] and other modern analogues, both on a test subset of the data set StanfordCars used for training [18] and on examples from the data set CompCars, which contains images of moving vehicles was captured by surveillance cameras with large ap- pearance variations due to the varying conditions of light, weather, traffic, etc [5]. As the test data of the set StanfordCars [18], we used 5000 “positive” ex- amples consisting of images of cars of one model, and 5000 “negative” pairs of images consisting of images of cars of different models, randomly selected from a subset of images that were not used in the training process. As test data of the set CompCars we used the original structure of test data from three diffi- culty levels, containing at each difficulty level 20,000 compared pairs of images (10,000 ”positive” and ”negative” examples) [5]. Each image pair in the “easy set” is selected from the same viewpoint, while each pair in the “medium set” is selected from a pair of random viewpoints. Each negative pair in the “hard set” is chosen from the same car make. As a quality criterion, a re-identification accuracy metric was used, calculated as: TP + TN Accuracy = 100 ∗ , %, (7) N where T P is an amount of correctly recognized ”positive” image pairs (the pair contains images of vehicles of the same model, and the verifier evaluates them 6 N. Nemcev et al. as elements of one subset), T N is an amount of correctly recognized ”negative” pairs, N is an amount of compared images. Evaluation of the effectiveness of the proposed approach and the approach on the test subset of the training data set is given in Table 1, while for evaluation of the effectiveness of the proposed criterion and metric based on the evaluation of the Euclidean distance, it was used a threshold selected to ensure maximum efficiency (accuracy) of verification on the training data. Table 1. Estimation of the effectiveness of the proposed criteria for the similarity of features (StanfordCars). Approach Accuracy, (%) Proposed feature extraction method 65,5 + Euclidean distance Proposed feature extraction method 71,2 + SVM [4] Proposed feature extraction method 69,8 + proposed feature criteria for the similarity of features Siamese network (Triplet Loss) - Random selection 50 Comparison presented in the Table 1 allows us to conclude that during pro- cessing data as close as possible to training, the proposed criterion for comparing feature vectors is somewhat inferior to the comparison method updated on ma- chine learning techniques (in this case, the method of support methods was used) and surpasses the approach based on estimating the Euclidean distance between vectors. It should also be noted that the lack of results for the siamese network [6] is due to the fact that, despite the enumeration of training hyperparameters, it was not possible to ensure the convergence of the network on the Stanford- Cars data set, which is a known problem in the process of training networks with Siamese architecture on heterogeneous data [1]. A comparative analysis of the effectiveness of the proposed approach on the CompCars [5] data set is given in Table. 2. Basing on the analysis of the table. 2, we can draw the following conclusions: - the proposed metric of similarity of feature vectors allows you to maintain the relative accuracy of verification when switching to a data set significantly different from the training, while the approach for comparing features using the support vector method significantly loses accuracy due to the ”overfitting” effect; - the proposed vehicle model verification system demonstrates on a set of data significantly different from the training accuracy comparable to the accuracy of similar basic algorithms (GoogleNet + SVM [5]), trained on a subset of the data close to the test ones, however, it loses much to modern verification approaches (Mixed Diff + CCL [5]), the training of which was also carried out on a subset of data similar in their characteristics to the test ones. Method for vehicle model re-identification 7 Table 2. Evaluation of the effectiveness of the proposed criteria for the similarity of features (CompCars). Approach Training Dataset Accuracy, % Easy Medium Hard Proposed feature extraction method StanfordCars[18] 62,1 60,3 57,6 + Euclidean distance Proposed feature extraction method StanfordCars[18] 67,1 63,7 61,5 + SVM [4] Proposed feature extraction method StanfordCars[18] 72,4 70,1 66,8 + the proposed criteria for the similarity of features GoogleNet + SVM [5] CompCars[5] 70 69 65,9 Mixed Diff+CCL [1] CompCars[5] 83,3 78,8 70,3 Random select - 50 50 50 In addition, it should be noted that the proposed approach for verifying a car model operates with shorter feature vectors compared to GoogleNet + SVM [5] (256 for the proposed approach, 4096 for GoogleNet + SVM [5]) and does not require a long training process for a module of similarity defining. And also, unlike Mixed Diff + CCL [1], which belongs to the class of Siamese networks, it allows to save the feature vector separately for further comparison without computationally complex features extraction operations. It is important for the class of cyberphysical systems under consideration to ensure the probability of completing tasks in a given time if the proposed approach is used in real-time, which, as shown in [19–21], is achievable with redundant calculations. 5 Conclusion The proposed approach for selecting and comparing features of objects from their image is used in the task of verifying a vehicle model. Selecting is being occured by modifying of a known artificial neural network trained on a spe- cially prepared (augmented) data set. The proposed criterion for the similarity of feature vectors is based on comparison techniques and has extremely low com- putational complexity. The proposed vehicle verification method demonstrates accuracy comparable to similar modern methods in those use cases when the pro- cessed data have the same characteristics as the training ones (a similar camera model, similar level of lighting and noise, etc.), and demonstrates higher relative accuracy in processing data that are significantly distinguished from training. 8 N. Nemcev et al. References 1. Liu H. et al. Deep relative distance learning: Tell the difference between similar vehicles // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. P. 2167–2175. 2. Rublee E. et al. ORB: An efficient alternative to SIFT or SURF // ICCV. 2011. V. 11. N 1. P. 2. 3. Pan X., Lyu S. Region duplication detection using image feature matching // IEEE Transactions on Information Forensics and Security. 2010. V. 5. N 4. P. 857–867. 4. Zapletal D., Herout A. Vehicle re-identification for automatic video traffic surveil- lance // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2016. P. 25–31. 5. Yang L. et al. A large-scale car dataset for fine-grained categorization and verifi- cation // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. P. 3973–3981. 6. Koch G., Zemel R., Salakhutdinov R. Siamese neural networks for one-shot image recognition // ICML deep learning workshop. 2015. V. 2. 7. Cheng D. et al. Person re-identification by multi-channel parts-based cnn with im- proved triplet loss function // Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. P. 1335–1344. 8. Ke Y. et al. PCA-SIFT: A more distinctive representation for local image descriptors // CVPR (2). 2004. V. 4. P. 506–513. 9. Ng P.C., Henikoff S. SIFT: Predicting amino acid changes that affect protein func- tion // Nucleic acids research. 2003. V. 31. N 13. P. 3812–3814. 10. Krizhevsky A., Sutskever I., Hinton G.E. Imagenet classification with deep con- volutional neural networks // Advances in neural information processing systems. 2012. P. 1097–1105. 11. Szegedy C. et al. Going deeper with convolutions // Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. P. 1–9. 12. Tanner M.A. Tools for statistical inference: observed data and data augmentation methods. Springer Science and Business Media, 2012. V. 67. 13. John G.H., Kohavi R., Pfleger K. Irrelevant features and the subset selection prob- lem // Machine Learning Proceedings 1994. Morgan Kaufmann, 1994. P. 121–129. 14. Joachims T. Making large-scale SVM learning practical. Technical report, SFB 475: Komplexitätsreduktion in Multivariaten Datenstrukturen, Universität Dortmund, 1998. N 1998, 28. 15. Hoffer E., Ailon N. Deep metric learning using triplet network //International Workshop on Similarity-Based Pattern Recognition. Springer, Cham, 2015. P. 84–92. 16. Xu B. et al. Empirical evaluation of rectified activations in convolutional network // arXiv preprint arXiv:1505.00853. 2015. 17. Howard A.G. et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications // arXiv preprint arXiv:1704.04861. 2017. 18. Krause J. et al. 3d object representations for fine-grained categorization // Pro- ceedings of the IEEE International Conference on Computer Vision Workshops. 2013. P. 554–561. 19. A. V. Bogatyrev, V. A. Bogatyrev and S. V. Bogatyrev, ”Multipath Redundant Transmission with Packet Segmentation,” 2019 Wave Electronics and its Applica- tion in Information and Telecommunication Systems (WECONF), Saint-Petersburg, Russia, 2019, pp. 1-4. doi: 10.1109/WECONF.2019.8840643 Method for vehicle model re-identification 9 20. V. A. Bogatyrev, S. V. Bogatyrev and A. V. Bogatyrev, ”Model and Interaction Ef- ficiency of Comput-er Nodes Based on Transfer Reservation at Multipath Routing,” 2019 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF), Saint-Petersburg, Russia, 2019, pp. 1-4. doi: 10.1109/WE- CONF.2019.8840647 21. Bogatyrev A.V., Bogatyrev S.V., Bogatyrev V.A. Analysis of the Timeliness of Redundant Service in the System of the Parallel-Series Connection of Nodes with Unlimited Queues//2018 Wave Electronics and its Application in Information and Telecommunication Systems (WECONF), 2018, pp. 8604379