=Paper=
{{Paper
|id=Vol-3027/paper107
|storemode=property
|title=Intelligent Image Labeling System for Recognizing Traffic Violations
|pdfUrl=https://ceur-ws.org/Vol-3027/paper107.pdf
|volume=Vol-3027
|authors=Dmitriy Titarev,Dmitriy Korostelyov,Valentin Titarev,Dmitriy Kopeliovich
}}
==Intelligent Image Labeling System for Recognizing Traffic Violations==
Intelligent Image Labeling System for Recognizing Traffic Violations Dmitriy Titarev 1, Dmitriy Korostelyov 1, Valentin Titarev 2 and Dmitriy Kopeliovich 1 1 Bryansk State Technical University, 7, 50 let Oktyabrya blvd., Bryansk, 241035, Russian Federation 2 Bryansk City Lyceum No. 2, 6, 22 Congress of the CPSU st., Bryansk, 241035, Russian Federation Abstract The article examines the problems of traffic violations and possible changes in the city, including in the road infrastructure, based on their analysis. A conclusion is made about the applicability of machine learning methods for preparing and marking images for solving the problem. The article describes in detail the algorithms for automatic marking of images for recognizing traffic violations in order to create a comfortable urban environment. The existing information systems that solve this problem are analyzed, with an indication of their strengths and weaknesses. The description of an intelligent system developed by the authors and combining manual and automatic object recognition is given. The system development tools are described, including the libraries used. The experimental part contains the results of testing the system, incl. neural network training. Information on the number of images and objects on them is given, as well as information on the percentage of correctly detected objects for the automatic image labeling. Keywords1 Automatic image labeling, intelligent system, neural networks, OpenCV, traffic violations, YOLO convolutional neural network, comfortable urban environment. 1. Introduction The development of technologies and algorithms for artificial intelligence makes it possible to find new opportunities for its effective use. One of the demanded and promising areas of artificial intelligence is the classification and identification of problem situations resulting from the analysis of images extracted from the video stream. Separately, the use of computer vision in transport should be highlighted. Already now, violations of traffic rules (hereinafter - TR) are automatically recorded in terms of speed limits, driving at a red traffic light, stopping in the wrong places, crossing the stop line [1, 2]. The consumer of these data systems is the traffic police, and they serve as the basis for issuing fines. However, collecting statistics on systematic TR violations can become the basis for creating a comfortable urban environment. The project of creating a comfortable environment is one of the priorities in our country and is supported at the level of the Government of the Russian Federation. For example, pedestrians regularly cross the road in the wrong place. After analyzing the information, we can conclude that they are crossing the road, since there is a public transport stop opposite, and the nearest pedestrian crossing is 500 meters away, and pedestrians are forced to violate TR. The solution to the problem within the framework of the formation of a comfortable urban environment can be the transfer or creation of a new pedestrian crossing. In order for artificial intelligence systems to successfully cope with such tasks, methods based on machine learning are usually used. One of the main stages in machine learning is the preparation and GraphiCon 2021: 31st International Conference on Computer Graphics and Vision, September 27-30, 2021, Nizhny Novgorod, Russia EMAIL: titaryovdv@mail.ru (D. Titarev); nigm85@mail.ru (D. Korostelyov); titarev-valentin@mail.ru (V. Titarev); dkopeliovich@rambler.ru (D. Kopeliovich) ORCID: 0000-0001-5502-2037 (D. Titarev); 0000-0002-0853-7940 (D. Korostelyov); 0000-0001-9867-9848 (V. Titarev); 0000-0003-4095-7029 (D. Kopeliovich) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) images labeling for which training is carried out in the future. This stage can take up an important part of the entire machine learning time (tens of percent of the project) [3]. Currently, there are several main approaches to the labeling and classification of images for further preparation [4, 5]: manual, automated, and automatic. In manual markup, the following methods are distinguished: in-house markup by analysts (in-house), outsourcing (attracting a third-party team of analysts), crowdsourcing (attracting individual specialists with the help of specialized platforms). In automatic marking, a synthetic method is distinguished (it involves the generation of new data with given attributes based on generative adversarial networks - GAN) and a software method (it involves the use of automatic marking systems). Automated methods include a combination of software labeling with its further correction and verification by experts. The advantage of manual methods over software ones usually lies in the quality of the labeling, at the same time, the use of software methods of labeling can significantly increase its speed. A logical consequence of these features is the combination of manual and automatic methods and the transition to automated labeling, i.e. use of automatic methods for primary labeling of images with their further correction by experts. The successful application of this approach is based on the use of a convenient and flexible tool for the expert, which allows you to quickly correct the automatically generated image labeling. Currently, the most common image markup systems are [6]: MakeSense.ai [7], LabelImg [8], VGG Image Annotator [9], LabelMe [10], Scalable [11], RectLabel [12]. Some of these systems support the possibility of automatic primary image labeling using predefined methods and models. This certainly increases the speed of labeling but does not allow using more accurate models for specific tasks, which would show the best results of automatic images labeling. For images labeling, the specialized formats COCO, Pascal VOC, and YOLO are usually used [4]. They have quite wide opportunities and in varying degrees support the following labeling types [4]: Bounding Boxes, Polygonal Segmentation, Semantic Segmentation, 3D Cuboids, Key-Point and Landmark, Lines and Splines. At the same time, the COCO format allows labeling for object detection, keypoint detection, stuff segmentation, panoptic segmentation, and image captioning. However, not all of these formats are fully supported in all of the above systems, which somewhat limits their applicability. Let us consider in more detail the functionality of existing image labeling systems and the limitations in their use to solve the problem. One of the main criteria is the ability to use images already marked up with the help of a neural network or experts. MakeSense.AI. One of the advantages of this solution is the ability to use an already labeled image as an input parameter. The use of neural networks is limited to only two options: React and React duo, which reduces the using area of the software. Another disadvantage is the low accuracy of object detection. LabelImg. This solution provides an expert with a wide range of tools for selecting objects for further training the neural network while showing good detection accuracy. At the same time, regardless of the complexity of the exposure, the speed of object detection is low. This solution does not allow choosing a neural network for training and loading already labeled images. VGG image annotator. Specialists using this application can recognize objects not only in images but also in video sequences, which significantly increases the scope of its application. It should also be noted that there are convenient tools for selecting objects. This solution is browser-based, which directly affected the speed of object detection. The choice of a neural network and the ability to upload labeled images are missing. LabelMe. The application shows a high speed of operation and good accuracy of object detection for exposures of varying degrees of complexity. An expert can choose a neural network, or combine several neural networks in order to improve the accuracy of object definition. The negative aspects of the application are high system requirements for equipment, high time costs for reading instructions and training, due to the complex interface. In addition, there is no possibility to upload already labeled images. Scalable. This application is designed to work on the Android mobile platform. It supports real-time object detection. The choice of the application implementation option affected the accuracy and speed of its operation, especially in cases of complex exposures. The application lacks the ability to select objects for further training the neural network. The choice of a neural network and the ability to upload labeled images are missing. RectLabel. The solution shows high detection accuracy, providing the expert with convenient tools for working with objects. At the same time, there is no possibility of choosing a neural network, as well as upload already labeled images for further work with them. In addition to those listed above, we can additionally highlight systems that allow automatic image annotation: V7 Darwin, Deepen.AI, Heartex, Alegion Control, Hasty.ai. Let's consider their strengths and weaknesses. V7 Darwin. The positive aspects of this solution include the high speed and accuracy of object detection, in addition to the automatic annotation mode, there are also tools for manual image labeling, a large number of dataset formats. Among the shortcomings, one should highlight possible problems for users using devices from AMD, as well as high requirements for hardware [13]. Deepen.AI. This platform is one of the few that offers the option of working with 3D LiDar, providing the user with a wide range of tools. Deepen.AI has the ability to merge sensor readings (for example, LiDar) and photo / video series. The disadvantages of this solution include high requirements for user qualifications, including knowledge and experience of working with 3D editors [14]. Heartex. This solution provides users with open source code, allows you to work with various types of files, is quite easy to learn, and while in the case of using the free version, its functionality is severely limited [15]. Alegion Control. This solution allows you to work with video in 4K resolution, including 3D annotation of objects, while imposing very high demands on the hardware [16]. Hasty.ai. From the positive aspects of this software, one should highlight the high speed and accuracy of work, an easy-to-learn interface, and support for vector and pixel annotations. However, it does not allow the user to select or load another neural network [17]. These circumstances indicate the relevance of the development of flexible intelligent image labeling systems that allow using various models of automatic primary labeling as well as supporting various labeling formats. A description of the algorithms and software implementation of one of such intelligent systems, namely, an image labeling system for recognizing traffic violations, is given below. 2. Analysis of algorithms for automatic image labeling for recognizing traffic violations Automatic image labeling for recognizing traffic violations is based on algorithms for recognizing traffic objects. The sequence of image flow analysis usually consists of the following stages: image acquisition, automatic detection of objects using a neural network, obtaining an array of enclosing boxes and calculating the characteristics of objects. Cameras installed at intersections, near roads, traffic lights, pedestrian crossings, etc. transmit an image or video stream to an intelligent system. They are then analyzed using a neural network, with the result that each object must be defined and highlighted using enclosing boxes. The detected objects are transferred to the traffic violations analysis subsystem. The object recognition algorithm from the received image stream determines all objects present on it, taking into account the probability of their presence, and displays the corresponding labels. For example, as a result of the operation of the algorithm, the presence of an object in the image - "car" with a probability of 0.98 can be determined. Recognition algorithms not only determine what objects are present in the image but also highlight their boundaries in the form of frames of a certain height and width. This type of labeling is supported by all the formats mentioned above. To detect an object, we must select areas of the analyzed image and apply a recognition algorithm to it. The location of an object is determined by the location of image areas where the probability of its finding is high. One of the first algorithms for objects detection were algorithms based on features: HOG-algorithm and Haar cascades [18]. 2.1. Histogram of Oriented Gradients (HOG-algorithm) The HOG-algorithm (Histogram of Oriented Gradients) is based on the assertion that the shape and appearance of an object in an image can be accurately described by the distribution of pixel intensity gradients corresponding to a given part. A gradient is an approximation by some function of intensity or brightness, the values of which are known only in pixels [19]. When using the HOG algorithm, it is necessary to apply a filter to the color and brightness components (1). [−1 0 1] and [−1 0 1]𝑇 (1) The term HOG-descriptor is directly related to the concept of a block - a rectangular area of image pixels of specified sizes. The block is the main component of the HOG descriptor. A block is a collection of cells that are a collection of pixels. The next step requires block normalization, for example, L2-norm (2). 𝑣 (2) 𝑓= 2 √‖𝑣‖2 + 𝑒 2 where v – unnormalized vector, e – constant. The HOG-algorithm has several disadvantages: the algorithm works well for one or two classes of objects, its efficiency drops sharply when adding objects of a new type; low speed of work; the accuracy of the algorithm is highly dependent on the type of object being recognized. 2.2. Haar Cascades Haar cascades are digital image features used in pattern recognition. Algorithms that work only with image intensity have great computational complexity. They were used in the first real-time face detector. Haar cascades are composed of contiguous rectangular regions. First, they are positioned on the image, then the intensity of the pixels in the regions is summed up and the difference of the sums is calculated. This difference represents the value of the feature under consideration, of a given size, located in a certain way on the image. Consider images of human faces. A common feature for them is that the area around the cheeks is lighter than the area around the eyes. Thus, a common Haar feature for faces is two adjacent rectangular regions lying on the cheeks and eyes. The main disadvantage of this algorithm is that Haar cascades require a large number of features to accurately describe an object since they are poorly suited for training and classification. The main advantage of the Haar cascade method in comparison with analogs is speed. But still, the speed of this algorithm significantly decreases with an increase in the number of objects of various types. 2.3. Cоnvоlutiоnal neural netwоrks Currently, most problems in the field of computer vision are solved using convolutional neural networks (Сonvоlutiоnal neural networks). One of the first convolutional neural network architectures was R-CNN [20]. It was developed by the UC Berkley team and used to solve the object definition problem. To improve the performance of this solution, regions that most likely contain objects, rather than a complete image, were fed to the input of the neural network. Regions, in turn, were prepared in advance using a different algorithm. CaffeNet was used as a convolutional neural network. The process of defining objects using R-CNN can be divided into the following steps: determination of image regions using the Selective Search algorithm; scaling the region to the size with which the neural network CaffeNet can work; obtaining a vector of object features using a neural network; carrying out classifications of each feature vector using SVM; linear regression of the parameters of the region frame for a more accurate definition of the object. A convolutional neural network consists of several layers: convolution and activation. The convolution layer processes each previous layer fragment by fragment, while in the process of training the neural network, the coefficients of the convolution kernel are determined. The convolution results are processed using a non-linear activation function. Typically, this is rectified linear unit – ReLU (3). 𝑓(𝑥) = max(0; 𝑥), (3) Using the ReLU has significantly increased the learning rate compared to previously used nonlinear functions such as sigmoids. This method is faster and more accurate than those described earlier, but at the same time, it places high demands on the amount of disk space for storing a significant number of features. In addition, the weak link of R-CNN is the selection of candidate regions before starting the main algorithm. 2.4. You Only Look Once You Only Look Once (hereinafter - YOLO) is one of the varieties of a convolutional neural network designed to recognize multiple objects in an image. This architecture of the convolutional neural network is currently one of the most popular. Unlike other algorithms, YOLO applies the neural network once to the entire image as a whole. In this case, a grid is superimposed on the image, then the neural network predicts the boundary frames and the probability that the object being determined will be located in them. YOLO has a higher operating speed and accuracy of object identification compared to R-CNN [21]. This became possible because YOLO unified and combined all the components necessary for detecting objects, and takes into account the context when working with incoming images. With YOLO you can recognize objects in real-time. Features of YOLO allow you to use it not only on high-power servers and portable equipment but also on mobile devices, which greatly expands the scope of its application. 3. Description of the software implementation of the intelligent system In the developed intelligent system, a convolutional neural network - YOLO was used, which has proven itself most well for solving this class of problems, since when determining traffic violations, a large number of objects of different types can be present in the image. The software package was developed using the C++ programming language and the OpenCV library. The system interface is shown in Figure 1. One of the features of the developed intelligent system is the ability to combine manual and automated labeling (Figure 2). At the same time, it is possible to select and plugin with the help of external modules various algorithms for automatic labeling, and the resulting file with markup itself can again enter the system input, which allows the determination of specific types of objects using the most efficient algorithms to detect them. Also, a distinctive feature of the system is not only the ability to highlight objects in images but also to indicate the belonging of each image to a specific class. For example, you can indicate that the image does not contain signs of traffic violations, or that the image contains signs of a specific violation (people cross a road in the wrong place, improper parking of the car, etc.). This circumstance significantly expands the possibilities of using the developed intelligent system, since not only machine learning methods can be applied to the resulting markup results, but also other methods of intelligent analysis (for example, classification using clustering or building decision trees). Figure 1: Intelligent system interface Figure 2: Scheme of interaction between an expert and an intelligent system 4. Experiment Description To carry out the experiments, pre-selected images from the Internet were supplemented with a collection of images obtained independently by the authors on the streets of the city of Bryansk. The use of ready-made trained neural networks showed a fairly significant number of defects in automatic marking. In fig. 3-5 shows examples of traffic objects detection of various types using the YOLO v.3 algorithm based on images obtained on their own. Figure 3: Traffic objects detection at an intersection Figure 4: Detection of road traffic objects Figure 5: An example of traffic objects detection As can be seen from the above figures, the recognition quality is not always ideal; therefore, at the first stage of the experiments, the task of training a specialized neural network was set. When implementing the intelligent system, a neural network training approach with a teacher was used. To train the YOLO convolutional neural network, you need a large number of photographs containing objects of various types. To train the neural network, both ready-made photographic objects from the Caltech Pedestrian Dataset database and photographs taken on the streets of Bryansk were used. In Figure 6, the following objects are visible: a car and road signs (speed limit, stopping and parking are prohibited, as well as information about how long it is valid). Figure 6: Image labeling for neural network training Knowing the correct result, we can train our neural network. Figure 7 shows a more complex composition of objects in the image, it contains: road signs, cars, and pedestrians. Figure 7: Image labeling for training a neural network on a complex composition As can be seen from Figure 7, further training of the neural network is required, since not all objects were detected by it. The white minibus at the stop was not detected and was not framed. The obtained intermediate results showed an improvement in the quality of automatic marking of objects in the image. Table 1 shows the metrics of training a neural network. From Table 1 that the number of errors after additional training of the neural network during object detection decreased by 6%. Table 1 Neural network training metrics Parameter Value The number of processed images during the first iteration of neural network 825 training The number of objects in images to be detected 4847 The number of incorrectly objects detection at the first training iteration 963 Mean average precision at the first training iteration 0.985 Mean average recall at the first training iteration 0.474 The number of processed images after manual labeling of objects by an expert 758 and additional training of the neural network The number of objects in images to be detected in the second iteration 4296 The number of incorrectly objects detection in the second iteration 604 Mean average precision in the second iteration 1 Mean average recall in the second iteration 0.715 Further, a number of rules were developed for automatic classification of the type of image (determining the presence of traffic violations and the type of traffic violations), which were integrated into the intelligent system. Depending on the presence of one segment of the image of objects of different types, the following classes of rules were distinguished: 1. On one segment of the image, there are objects of different types. For example: Pedestrian crossing sign, car, people. Possible violation - the driver of the car did not let pedestrians pass at the pedestrian crossing. Cars, people. Possible violation - pedestrians cross the road in the wrong place. Traffic light, car, people. Possible violation - the driver of the car did not let pedestrians pass or pedestrians cross the road at a traffic light prohibiting sign. 2. The imposition of one object on another. For example: One of the objects is road markings (solid or double line), the other is a car. A possible violation is the intersection of a solid or double solid line. One of the objects is a designated lane for cyclists, the other is a car. Potential violation - driving a car into a designated lane for cyclists. The application of these rules added the ability for an expert to receive an automatic hint when classifying the types of violations in images. 5. Discussion of Experimental Results The above results of experiments on automatic image labeling based on neural networks have shown good results, but they cannot be recognized as ideal (even after additional training of neural networks). It is for this reason that one cannot rely solely on automatic methods for high-quality image labeling, because in this case there is a high risk of missing essential details or, conversely, finding missing objects. This is especially important for the automatic detection of traffic violations. Therefore, the involvement of experts in the image labeling procedure is a demanded and important approach. The combination of these approaches in the developed intellectual system made it possible, on the one hand, to significantly reduce the time of primary marking of the image due to the automatic detection of objects and classification of types of violations. On the other hand, it provided the expert with the opportunity to make changes manually, as well as to choose different methods of automatic detection (different neural networks, different modules) depending on the types of objects being detected, and made it possible to significantly improve the quality of the resulting labeling. 6. Conclusion The development of an intelligent image labeling system for recognizing traffic violations allows them to be used for further analysis in order to create a comfortable urban environment. At the same time, the use of machine learning methods for preparing and images labeling has significantly reduced the operating time of the entire system as a whole and provided the experts involved in the analysis with a convenient, multifunctional tool. With the combination of manual and automatic marking of images, the ability to use already labeled images as an input parameter has significantly reduced the number of errors in object detection using a neural network. Equally important is the ability to indicate the type of traffic violation or the choice of the neural network used in the intelligent system, which ultimately also affected the speed of its operation and the number of errors in the resulting data. Integration with information systems containing information about road signs and road labeling is a promising method for improving the quality of automatic preliminary labeling of road objects. Possible directions for further development of the system are: creation of a universal platform suitable for labeling and classifying images for various tasks; creation of a multi-user system for parallelizing the labeling procedure; support for cloud storage of images for organizing centralized access to it for various experts. 7. References [1] Fozia Mehboob. "Mathematical model based traffic violations identification" Computational and Mathematical Organization Theory. 2019. P. 302-318. doi: 10.1007/s10588-018-9264-x. [2] Shiva Asadianfam. "TVD-MRDL: traffic violation detection system using MapReduce-based deep learning for large-scale data" Multimedia Tools and Applications. 2021. P. 2489-2516. doi: 10.1007/s11042-020-09714-8. [3] Data Engineering, Preparation, and Labeling for AI 2019. URL: https://www.cloudfactory.com/reports/data-engineering-preparation-labeling-for-ai. [4] 5 Approaches to Data Labeling for Machine Learning Projects. URL: https://lionbridge.ai/articles/5-approaches-to-data-labeling-for-machine-learning-projects/. [5] A. Zakharova and D. Korostelyov, "Visual Classification of Data Sets with the Assistance of Experts in the Problems of Intelligent Agents Learning for Incompletely Automated Control Systems," 2019 Dynamics of Systems, Mechanisms and Machines (Dynamics), 2019, pp. 1-5, doi: 10.1109/Dynamics47113.2019.8944638. [6] Image Data Labelling and Annotation - Everything you need to know. URL: https://www.xailient.com/post/image-data-labelling-and-annotation. [7] Make Sense. URL: https://www.makesense.ai. [8] GitHub – tzutalin/labelImg: LabelImg is a graphical image annotation tool and label object bounding boxes in images. URL: https://github.com/tzutalin/labelImg. [9] VGG Image Annotator: a standalone image annotator application packaged as a single HTML file that runs on most modern web browsers. URL: https://gitlab.com/vgg/via. [10] LabelMe, the open annotation tool. URL: http://labelme.csail.mit.edu/Release3.0/. [11] Scalabel. URL: https://scalabel.ai/. [12] RectLabel – Labeling images for bounding box object detection and segmentation. URL: https://rectlabel.com/. [13] V7 - AI Data Platform for ML Teams. URL: https://www.v7labs.com/. [14] Industry leading multi-sensor, LiDAR annotation and labelling tools. URL: https://www.deepen.ai/. [15] Data Labeling Platform for Machine Learning – Heartex. URL: https://www.heartex.com/. [16] Alegion | Data Labeling Software Platform. URL: https://www.alegion.com. [17] Hasty.ai - A single application for all your vision AI needs. URL: https://www.hasty.ai. [18] S. Ren, K. He, R. Girshick, and J. Sun. "Faster R-CNN: Tо-wards real-time оbject detectiоn with regiоn prоpоsal net-wоrks." IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017. Vol. 39(6). P. 1137-1149. doi: 10.1109/TPAMI.2016.2577031. [19] Navneet, D. "Histоgrams оf Oriented Gradients fоr Human Detectiоn" IEEE Cоmputer Sоciety Cоnference оn Cоmputer Visiоn and Pattern Recоgnitiоn (CVRP). 2005. doi: 10.1109/CVPR.2005.177. [20] J. Dai, K. He, Y. Li, S. Ren, and J. Sun. "Instance-sensitive fully cоnvоlutiоnal netwоrks" European Conference on Computer Vision. 2016. P. 534-549. doi: 10.1007/978-3-319-46466- 4_32. [21] Redmon, Joseph and Ali Farhadi. "YOLOv3: An Incremental Improvement." ArXiv abs/1804.02767 (2018): n. pag.